Research Repository

Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery

Kasap, Server and Wächter, Eduardo Weber and Zhai, Xiaojun and Ehsan, Shoaib and McDonald-Maier, Klaus D (2021) 'Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery.' Microelectronics Reliability, 124. p. 114297. ISSN 0026-2714

1-s2.0-S0026271421002638-main.pdf - Published Version
Available under License Creative Commons Attribution.

Download (741kB) | Preview


All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.

Item Type: Article
Uncontrolled Keywords: Lockstep; Reliability; Fault tolerance; Soft error mitigation; Zynq APSoC; ARM cortex-a processor; MicroBlaze processor
Divisions: Faculty of Science and Health
Faculty of Science and Health > Computer Science and Electronic Engineering, School of
SWORD Depositor: Elements
Depositing User: Elements
Date Deposited: 13 Aug 2021 20:12
Last Modified: 18 Aug 2022 12:22

Actions (login required)

View Item View Item