Wachter, Eduardo and Fochi, Vinicius and Barreto, Francisco and Amory, Alexandre and Moraes, Fernando (2018) A Hierarchical and Distributed Fault Tolerant Proposal for NoC-Based MPSoCs. IEEE Transactions on Emerging Topics in Computing, 6 (4). pp. 524-537. DOI https://doi.org/10.1109/tetc.2016.2593640
Wachter, Eduardo and Fochi, Vinicius and Barreto, Francisco and Amory, Alexandre and Moraes, Fernando (2018) A Hierarchical and Distributed Fault Tolerant Proposal for NoC-Based MPSoCs. IEEE Transactions on Emerging Topics in Computing, 6 (4). pp. 524-537. DOI https://doi.org/10.1109/tetc.2016.2593640
Wachter, Eduardo and Fochi, Vinicius and Barreto, Francisco and Amory, Alexandre and Moraes, Fernando (2018) A Hierarchical and Distributed Fault Tolerant Proposal for NoC-Based MPSoCs. IEEE Transactions on Emerging Topics in Computing, 6 (4). pp. 524-537. DOI https://doi.org/10.1109/tetc.2016.2593640
Abstract
Aggressive scaling of CMOS process technology allows the fabrication of highly integrated chips such as NoC-based MPSoCs. However, fault probability increases when devices’ size reduces. Hence, fault tolerant design has an important role in current nanometric technologies, leading to research on fault mitigation techniques for NoC-based MPSoCs. Most of the state-of-the-art papers present partial solutions to design a fault tolerant MPSoC, i.e., they present fault tolerant mechanisms for either NoCs or processing elements (PEs). The goal of this paper is to propose a comprehensive integration of previously defined recovery mechanisms. The main novelty is the system-level integration itself, which is organized in a hierarchical and distributed manner, ensuring the correct execution of applications in the presence of multiple transient or permanent faults in both the NoC and/or the PEs. The combination of both NoC and PE recovery methods enable the proposed system to tolerate a very severe number of faults. Depending on the severity of the fault in the NoC, it may operate in degraded mode or require the search of fault-free paths. In both cases, the communication is reestablished in less than 50 microseconds. Faults detected into the PEs fire a lightweight and fast task relocation protocol, which executes in less than one millisecond.
Item Type: | Article |
---|---|
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Divisions: | Faculty of Science and Health Faculty of Science and Health > Computer Science and Electronic Engineering, School of |
SWORD Depositor: | Unnamed user with email elements@essex.ac.uk |
Depositing User: | Unnamed user with email elements@essex.ac.uk |
Date Deposited: | 19 Mar 2019 16:43 |
Last Modified: | 06 Jan 2022 13:58 |
URI: | http://repository.essex.ac.uk/id/eprint/24083 |