Select Page

Design of SoC for high reliability Embedded Processor systems

ABSTRACT”

Reliability is vital for many embedded applications including industrial controllers and automotive electronics. There are many well-established techniques for creating reliable high-end control systems, and now these techniques are also appearing in small embedded systems including many microcontroller products based on ARM® processors. This study looks at the overview of system design techniques and processor specific features that are commonly used in such applications including dual-core lock-step and the processor’s internal memory protection (e.g. parity, ECC), as well as system level enhancements like bus level timeout monitors and hardware monitoring units. This study also looks at the case of pitfalls that an SoC designer needs to look at while designing the embedded processor systems for high reliability applications like space , automobile etc.

“SCOPE” There is a wide range of embedded systems requiring high reliability. Traditionally, industrial controllers and automotive electronics are key areas of the electronics industry that demand high reliability. Today, however, embedded systems are also deployed in medical equipment, smart building management, and other areas that have higher reliability requirements than traditional consumer electronics such as home appliances and entertainment systems. Technology trends like the Internet of Things (IoT) are also driving demand for improved reliability. With IoT, there are new implications for a range of electronic product designs. For example, a home broadband router built into the IoT infrastructure allows a fire alarm and a security system to be linked with a home owner’s smartphone and local emergency services.

TYPICAL REQUIREMENTS OF HIGH RELIABLE EMBEDDED PROCESSOR SYSTEMS Typical key technical requirements for high reliable embedded systems include :- 1) Reducing the possibility of errors. 2) Detection of errors. 3) Support correction of errors. 4) Robustness – a single point of failure should not be able to manifest into a complete system failure.

TYPICAL CAUSES FOR SYSTEM FAILURE Memory: This is the most common concern for SoC designers. As memory density gets higher and higher, the energy required to toggle a memory bit is reduced, and therefore accidental changing of memory states can be triggered easily by various scenarios like a transient pulse in power wire or signal connections, electrostatic discharge, a hit by a radiation particle or even interference from nearby RF transmitter.

Logic: This is a less common cause as modern chip design tools provide very good signal integrity checking, and most hardware failure can be detected by manufacturing tests (e.g. a scan test). However, logic failure can still be caused by electron migrations and transient pulses in a power supply or signal connections.
Software: This is possibly the most common cause of system failures in finished products. Common mistakes like inadequate checking of external inputs, incorrect setup for stack and heap memory size or simple programming bugs can all result in different types of system failure.

TIGHTLY COUPLED MEMORY FOR MEMORY ERROR HANDLING In order to handle the memory errors, ARM processors provide error handling interface typically seen in ARM M7 processors.

  • If an ECC error is detected, the ECC computation logic can signal to the processor that the read operation needs to be retried (a retry signal). If the error can be corrected, the corrected data is forwarded to the processor, and can also be written back to the SRAM at the same time.
  • If the ECC error cannot be corrected (a multi-bit error), then the ECC computation logic can send an error status back to the processor to indicate a fault (an error signal). In this case, a fault exception could be triggered on the Cortex-M7 processor and the error is dealt with inside the fault exception handler.

PROTECTION AGAINST HARDWARE LOGIC ERRORS There are a number of ways to enhance the reliability of the logic hardware. Typically, in automotive applications, the logic cell libraries are only characterized for operation conditions with minimum risk of logic failure. As a result, the same design using the same semiconductor process could run slower (e.g. with larger margin for timing), but can run with higher reliability (i.e. lower possibility of logic failure). In some cases, chip designers can also use special processes that are radiation-hardened (RAD–hardened) for specialized IC design. The reliability of the system can be improved further by ensuring the following design practices .

  • Ensuring good floor planning and power rail design to reduce IR drop (internal power rail voltage drop).
  • Using EDA tools to ensure good signal integrity.
  • Enhancing the error detection features in the system.

Typically the logic errors are not big of a factor in the system performance and is typically much lower than memory failures seen in the system.

LOGIC ERRORS DETECTION WITH DUAL CORE LOCK STEP In order to detect errors in logic during run-time, some embedded systems deploy two instantiations of the same processor, and compare the outputs as a way of detecting hardware errors. This is commonly known as dual-core lockstep. In such designs, the two processors starting with the same state (reset) and receive the same inputs (execute the same program), and therefore the outputs should be identical. If the outputs from the two processors mismatch, then we know that a logic failure has occurred and can be handled by a separated mechanism (with two cores, it is impossible to tell which one is correct). The Cortex-M7 processor supports dual-core lock-step configuration option. In this configuration, the core logic is instantiated twice, but the cache memory and TCM interface are shared because they can be protected by ECC which has a lower silicon area overhead (figure 3). Since the two processor cores’ logics execute the same program, it does not enhance the performance of the systems by having two processor cores.

SOFTWARE ERRORS THAT AFFECT THE SYSTEM PERFORMANCE There can be many causes for software related failures. The majority of them are simply bugs in software components, such as inadequate validation of external inputs, poor software design that leads to race conditions, or in some cases, a failure to reserve enough memory for stack or heap memories. There are also failure conditions that a software designer cannot have foreseen. For example, if an internet connected device is attacked via DoS (Denial of Service), or if it suddenly receives unexpectedly high amounts of input packets, a system could fail due to performance or memory size limitations. Potentially, a significant performance margin or memory size margin is required.

CASE STUDY FOR THE DOS ATTACKS FOR SOFTWARE ERRORS Denial of service attacks can be simply defined as overflowing the target microcontroller / computer with sufficient bandwidth so that we are effectively flooding the system and crashing it. There are many variants of the Denial of service attacks , the most common types of attacks. 1.Application layer attacks 2.Distributed Denial of service attacks 3.unintended Denial of service attacks

CASE STUDY FOR THE DENIAL OF SERVICE ATTACKS The most common type DOS attack seen in the micro controller systems seen were the buffer overflow attacks . Consider the case of the following program :-

  1. include <stdio.h>
  2. include <string.h>

int main (int argc, char** argv) { char buffer [500]; strcpy(buffer,argv(1)); return 0; }

The program shown is a simple program that accepts data from the command line and stores into the buffer. The data entered in the command line will be stored in the buffer that will be allocated in the stack. Further there will be return variable that will be allocated in the stack along with the function parameters. Buffer overflow is the concept where the return address is overwritten on the stack and leading to problems. To compile the program and see the vulnerability the use the “gcc-fno-stack-protector “to check the vulnerability for the program . After checking the program validity , we can take it a bit further to see how the program works. The program allocates the buffer of 500 for the command line arguments, if we go past the allocated spaces and extend it, we would be overwriting the return address in the stack thereby making the program crash. We can use the python function in the debugger to print some characters into the buffer ( say x41 ) in the buffer to overwrite by say 506. This would result in segmentation fault causing the program to crash . Upon further analysis it was seen that the return address (0xb7004141) was overwritten by the x41 character by half. The overwritten new memory address will not be a valid memory address for this process / will not contain anything in the new overwritten address. we increase the written buffer by 2 to make it 508, and repeat the same process. Further if there is any virus code at the return address , there would a problem. We would have completely overwritten by x41 (0x41414141). The program would go to the return address pointed (0x41414141) and there would be nothing in the address. Now that we have overwritten we can look at the case of rewriting the return address so that it points to a payload. Payload is going to put some variables in the stack and is going to execute a system call to execute the shell. The shell code used for the buffer overflow exploit will be as shown: – “\x31\xc0\xb0\x46\x31\xdb\x31\xc9\xcd\x80\xeb\x16\x5b\x31\xc0\x88\x43\x07\x89\x5b\x08\x89\x43\x0c\xb0\

x0b\x8d\x4b\x08\x8d\x53\x0c\xcd\x80\xe8\xe5\xff\xff\xff\x2f\x62\x69\x6e\x2f\x73\x68“ \xcd \x80 executes the system interrupt call to run the shell code. The shell code is generated using the tool msfvenom. Crucially it is necessary to ensure a few things like there are no null characters in the shell code and the shell code generated is very small. Upon running the shell code and suitably adjusting the memory suitably we would be able to gain root access to the system.

PROTECTION AGAINST SOFTWARE ERRORS 1. using the MPU: – MPU is an optional component , however should be featured in the embedded system because to improve the reliability of the embedded system. Typically seen for the bare metal , RTOS systems which would support the MPU. 2.Stack checking : – this is done to ensure that the application code doesn’t try to access the part of the memory region , which it should not. 3.Separate the application program stack and the exception handler stack: – Cortex-M processors have two stack pointers to support efficient OS operations. However, for embedded systems without an OS, it is still possible to utilize this feature so that the exception handler’s stack is separated from the application code stack.

REFERENCES

* https://users.ece.utexas.edu/~valvano/Volume1/E-Book/C16_InternetOfThings.htm
* https://www.cs12.tf.fau.eu/teaching/teaching-events/seminars/reliability-analysis-in-embedded-systems/
* P. Liggesmeyer and O. Maeckel, "Quantifying the reliability of embedded systems by automated analysis," 2001 International Conference on Dependable Systems and Networks, Goteborg, Sweden, 2001, pp. 89-94, doi: 10.1109/DSN.2001.941395.
* Whitepaper written by Joseph yiu on Design of SoC for High Reliability Systems withEmbedded Processors