Compiler Assisted Reliability Optimizations
More Info
expand_more
Abstract
Microprocessors are used in an expanding range of applications from small embedded system devices to supercomputers and mainframes. Moreover, embedded microprocessor based systems became essential in modern societies. Depending on the application domain, embedded systems have to satisfy different constraints. The major challenges today are cost, performance, energyconsumption, reliability, real-time (reactive-operation) and silicon area. In traditional computer systems some of these constraints can be less crucial than others, while performance, area and power-consumption will always remain valid constraints for embedded systems. However, in modern systems reliability has emerged as a new, highly important requirement. Among all above factors performance, power, reactive-operation and reliability can be addressed by software-only techniques that do not require any hardware modifications or additions. Such optimization techniques, however, may impact the performance and power characteristics of the system. The main goal of this work is to find novel software based reliability techniques with affordable power and performance overheads. For this reason the reliability optimization methods are studied in detail and a diligent categorization of existing software techniques is proposed. The strong and the weak points of each category are carefully studied. Using the information obtained from our categorization, two novel optimization techniques for fault detection and one new optimization technique for fault recovery are proposed. Our optimization techniques minimize the required code instrumentation points while guaranteeing equivalent reliability as compared to state of the art approaches. Moreover, a generic methodology is proposed to help with the process of identifying the minimum set of code instrumentation points. For the evaluation we select a challenging baseline that consists of the best known techniques for fault detection and fault recovery found in the public literature. The experimental results on a set of biomedical benchmarks show that using the proposed design methodology and fault detection and recovery methods, the performance and power overheads are significantly reduced while the fault coverage remains in line with previously proposed and widely used methods.