Exploiting MIPS embedded devices

Several security activities here at Emaze involve embedded devices, such as routers, access points, web cams, and so on. Today these devices are widespread, as several manufacturers or content providers push to distribute embedded hardware peripherals to their customer. Unfortunately, the security level of such platforms is often questionable. In this post we focus on a specific category of security flaws, memory errors, we discuss their impact on embedded platforms and we give an overview of possible exploitation techniques and countermeasures.

Most of the devices we typically deal with are powered by MIPS or ARM processors. In the following we discuss only the security-relevant aspects of the big-endian 32-bit MIPS architecture, but most of the material applies to the ARM platform as well. We assume the reader is confident with the exploitation of memory errors on the Intel x86 architecture.


A glimpse at the MIPS architecture
MIPS is a RISC instruction set architecture (ISA) often used in broadband routers, network gateways, and other embedded devices. MIPS CPU can be booted both in big- and little-endian configuration. There exist 32- and 64-bit versions of this ISA, but the former is certainly more common in embedded systems. As mentioned above, in this blog post we focus on big-endian 32-bit MIPS architecture. In the next paragraphs we introduce some details concerning the MIPS ISA that are relevant for the exploitation of memory errors. For a comprehensive reference on MIPS, interested readers can refer to this book.

MIPS ISA
As MIPS is a RISC ISA, instructions have fixed length: every MIPS instruction is 4-byte long. On MIPS, 32-bit memory accesses must be 4-byte aligned, as a consequence instructions must be placed at 4-byte aligned memory locations, otherwise a hardware exception is risen by the CPU. Another fundamental aspect of MIPS is its pipelining architecture, and the impact of the pipeline on instructions execution. From a programmer (and attacker) perspective, the most important consequence is that, when a branch instruction is reached, the instruction after the branch will be executed before the target of the branch is reached. The instruction position immediately after a branch is called branch delay slot.

Calling convention
When a function is called, the first four parameters are passed through registers $a0 through $a3, and the remaining parameters are pushed on the stack. When the callee terminates, it stores the return value in register $v0 (also $v1 can be used for a second return value, but this is not very common). As we will deal with memory errors, an important point is where a function return address is stored. Differently from x86 architectures, when a function is called the return address is not pushed on the stack, as the architecture reserves a special-purpose register for it, namely $ra. However, since there is just a single instance of this register, the previous value of $ra must be saved somewhere, otherwise nested function calls would not be possible. For this reason, the compiler arranges the prologue of non-leaf functions (i.e., functions that in turn call other procedures) to save $ra on the stack, in order to be able to restore its original value in the epilogue. As a consequence, only the return address for the current function is stored in $ra, while the others are still on the stack. Lets see a very simple example by examining the following C program source:
int foo(int a, int b, int c, int d, int e)
{
    return a+b;
}

int main(int argc, char **argv)
{
    return foo(1, 2, 3, 4, 5);
}
The main() procedure invokes foo(), passing five arguments. The disassembly for main()  is shown in the next figure. Registers $a0 through $a3 are used to store the first four arguments, while the last one is pushed on the stack (addresses 0x00400640-0x00400644). As main() is not a leaf function (it invokes foo()), $ra value is saved during the prologue and restored in the function epilogue.


The assembly code of foo(), depicted in the next image, is really simple, but it highlights the use of branch-delay slots: foo() computes the sum of the first two parameters, but the addu instruction is placed right after the “jump to $ra” statement; despite  addu follows the branch statements, the execution of the former is completed before the target of the branch is reached.


Stack smashing protection
Embedded devices are not as protected against exploitation attempts as traditional systems, but they are not free from defenses. In this section we give a brief overview of existing stack smashing protections for MIPS systems, including address space layout randomization (ASLR), execution prevention (W^X) and stack canaries (SSP, in GCC jargon). In the following we focus on MIPS device running Linux, as it is the most widely adopted operating system for this class of devices.

Address space layout randomization
Linux randomizes the stack layout of MIPS processes since 2.6.23. Starting from kernel version 2.6.26, also the heap and shared libraries were randomized. To the best of our knowledge, PIE executables are not so common on MIPS, and even shared libraries are not randomized very often.

"W^X" protection
Briefly, architectures that implement the "W^X" (Write XOR eXecute) protection allow to mark specific memory pages as executable but not writable. This kind of protection is usually enforced for those section that could store user-controlled data (e.g., the stack). On Intel processors, this policy is enforced through the NX page protection bit. At the time of writing, we are not aware of any MIPS processor with hardware support for the "W^X" protection. However, considering the rapid diffusion of MIPS devices, it is probably just a matter of time before also these CPUs start to include something similar to the Intel NX bit.

Stack canaries
Canary protection is named SSP (Stack-Smashing Protection) in current GCC versions, and, for the MIPS architecture, is implemented starting from GCC 4.5. Besides being supported by the compiler, the SSP protection must also be supported by run-time libraries (uClibc supports SSP since 2004).

The next figure demonstrates how GCC 4.5 implements SSP on MIPS. The implementation is very similar to that on the x86 platform. The canary is loaded from the global variable __stack_chk_guard and stored on the stack next to function local variables (0x004003f0). Then, during function epilogue, __stack_chk_guard is compared with the current value of the canary stored on the stack (0x00400430); if the two values differ, then execution is aborted by calling __stack_chk_fail.


Exploitation
Overall, exploitation of memory error vulnerabilities on embedded devices is almost the same as on traditional (i.e., x86) architectures. However, several subtle details must be taken into account in order to build effective and reliable exploits. First of all, it should be considered that, when we deal with embedded devices we must usually focus on remote exploits, as it is not often the case that attackers can leverage a local access to the device. In addition, attacks that rely on brute forcing attempts and make the target process to crash when the exploit fails are typically not suitable for embedded systems: these devices often run a big, single-thread binary file which provides most of the application-level functionalities; if this process crashes, the whole device becomes unusable.

Memory alignment
As we already mentioned, branch targets must be 4-byte aligned.
On a MIPS Linux system, processes which perform unaligned control transfers will be terminated through a SIGILL signal. As an example, when exploiting a vulnerability that allows to overwrite a code pointer (e.g., a saved return address), the new value must be aligned on a word boundary.

ROP and MIPS
"Return oriented programming" (ROP) is a popular exploitation technique, that can be considered as an evolution of the traditional "return-into-lib(c)" strategy, proposed by Solar Designer back in 1997. In a nutshell, return oriented programming consists into borrowing existing code chunks (gadgets) from the virtual address space of a target process; each code chunk terminates with a control transfer instruction (CTI), and the stack layout is prepared so that these CTIs chain multiple chunks together. In the last years, ROP has demonstrated to be effective against NX and partial ASLR, and it has been also used as the basic building block for surgically precise exploitation techniques.

Obviously, ROP also applies to MIPS; however, two characteristics of the architecture complicate the exploitation. First, MIPS instructions are 4-byte long, and, due to memory alignment checks, it is not possible to jump in the middle of existing instructions. Secondly, branch delay slots must be kept in mind: even if a gadget terminates with a branch instruction, also the instruction next to the branch will be executed before control is transferred to the next code chunk. Overall, these complications may drastically reduce the number of useful code chunks available.

Cache coherence
MIPS CPUs have two separate caches: the instruction cache (I-cache) and the data cache (D-cache). As the names suggest, code and data are cached in the I-cache and D-cache, respectively. Once a cache is full, it is flushed, i.e., its content is written-back to main memory.
But why should caches affect exploitation? As an attack payload is typically handled by the application as data, it will be stored inside the D-cache. When the payload triggers the vulnerability and hijacks the application control flow, execution is transferred at the memory address of the shellcode. However, if the D-cache has not been flushed yet, shellcode would be still stored inside the D-cache, and not into main memory. As a consequence, the target application would execute whatever is located at the memory location where the attacker intended to store the shellcode, with unexpected consequences.

Attackers leverage on two different solutions to overcome this problem. The first solution consists into filling the D-cache to force the CPU to write-back the shellcode into main memory. This approach is very simple to implement, but it is not free from limitations: as MIPS caches can store several kilobytes of data,  the drawback of this approach is that the attacker must be able to inject a large amount of data into the address space of the target application in order to fill the D-cache. Unfortunately, this is not often the case.

A more effective and reliable solution consists in forcing the CPU to flush its caches. If the target system runs Linux, attackers can leverage the cacheflush system call to flush both the I- and D-caches. Exploitation is even easier if the vulnerable program is linked to the uClibc library, as it provides a cacheflush() function that can be invoked through ROP. 

Shellcode
Writing a MIPS shellcode is not as easy as one would expect: all instructions are 4-byte long, and is quite difficult to build a NULL-free sequence of instructions that spawns a shell. A simpler solution is to leverage an encoder to decode the core of the shellcode at run-time, such as the MIPS XOR encoder included in Metasploit. However, the Metasploit framework includes just a single MIPS shellcode (a TCP reverse shell). As a side note, it is worth considering that the Metasploit MIPS shellcode connects-back to a hardcoded IP address; we recently submitted to the developers a patch that addresses this issue.

Conclusions
A lot of research efforts have been put into the exploitation of memory errors, but most of these works focus on the x86 architecture, the most widely adopted computing platform. With the diffusion of embedded devices, researchers (and attackers) started to move their attention to these platforms. Today's embedded architectures introduce novel challenges for exploit developers, that are not found in desktop or server environments. Despite this fact, exploitation of memory error vulnerabilities is still possible, and is often made easier by the lack of proper software and hardware protection facilities.

2 comments:

  1. Might be useful for others who want to look into it in detail. The
    heap/lib randomization code got moved.

    See:
    commit 6f6c3c33c027f2c83d53e8562cd9daa73fe8108b
    Author: Ralf Baechle
    Date: Thu May 19 09:21:33 2011 +0100

    ReplyDelete
  2. Hi Roberto,

    Nice Article,thanks for sharing this information.Looking forward for more posts like this.













    Embedded Systems india

    ReplyDelete