Kasap, Server and Wächter, Eduardo Weber and Zhai, Xiaojun and Ehsan, Shoaib and McDonald-Maier, Klaus D (2021) Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery. Microelectronics Reliability, 124. p. 114297. DOI https://doi.org/10.1016/j.microrel.2021.114297
Kasap, Server and Wächter, Eduardo Weber and Zhai, Xiaojun and Ehsan, Shoaib and McDonald-Maier, Klaus D (2021) Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery. Microelectronics Reliability, 124. p. 114297. DOI https://doi.org/10.1016/j.microrel.2021.114297
Kasap, Server and Wächter, Eduardo Weber and Zhai, Xiaojun and Ehsan, Shoaib and McDonald-Maier, Klaus D (2021) Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery. Microelectronics Reliability, 124. p. 114297. DOI https://doi.org/10.1016/j.microrel.2021.114297
Abstract
All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.
Item Type: | Article |
---|---|
Uncontrolled Keywords: | Lockstep; Reliability; Fault tolerance; Soft error mitigation; Zynq APSoC; ARM cortex-a processor; MicroBlaze processor |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 13 Aug 2021 20:12 |
Last Modified: | 30 Oct 2024 16:24 |
URI: | http://repository.essex.ac.uk/id/eprint/30895 |
Available files
Filename: 1-s2.0-S0026271421002638-main.pdf
Licence: Creative Commons: Attribution 3.0