Abstract-Concurrent on-line testing (COLT) of manycore systems-on-chip (SoC) has been recently proposed by researchers in response to the growing threat of electronic wear-out to system operational lifetimes and to the increasing reliability and availability demands of safety-critical applications. Previous research in concurrent on-line testing has focused on centralized approaches to manage core testing while the system is available to execute normal user applications. However, as technology scaling allows dozens and hundreds of processing cores to be placed on a single chip, these centralized approaches are not scalable solutions. In this paper, a distributed concurrent on-line test scheduling protocol is proposed and evaluated against previously developed solutions. Our experiments show that a distributed COLT scheduler can test a moderately-sized SoC with a speedup of 3.85 over centralized approaches while consuming 84% less energy, and performance benefits improve as the number of cores per chip increases. This research also presents a core test ordering algorithm -Code-Division Core Test Scheduling -that provides an additional 40% reduction in system test latency compared to other schedulers.
I. INTRODUCTIONRapid technology scaling has forced systems designers, reliability engineers and application programmers to rethink the fundamental design practices that have dominated computer system design for more than the past two decades. Multi-core systems-on-chip (SoC), with a handful of complex processing cores and integrated peripheral components, are predicted to be replaced by many-core SoC that contain hundreds or thousands of lightweight processing cores, memory and I/O subsystems. These many-core SoC will use packet switched networkson-chip (NoC) for inter-core communication, as opposed to the current standard of on-chip busses, [1,7]. Notable examples of this architecture include the 64-core TILE64 from TILERA [24] and the 80-core Intel Terascale SoC [11].As technology scaling has provided new opportunities for massively parallel and distributed computation to be performed on a single chip, new reliability challenges have also emerged. In addition to the well-understood circuit failures due to manufacturing imperfections, SoC components are also more susceptible to electronic wearout -permanent failures that emerge during use -as feature sizes scale below 65nm [2,6,7].In actuality, electronic wear-out is a combination of several physical degradation mechanisms, including electro-migration (EM), hot carrier injection (HCI) and negative bias temperature instability (NBTI), that are intensified by smaller feature sizes, higher current and power densities, and higher operating temperatures [2].Because the most significant electronic wear-out mechanisms manifest as an increasingly severe delay fault at the circuit level, many researchers have proposed the use of SCAN-based delay testing for detecting this type of error [4,14,15]. Built-in self-test (BIST) architectures using pseudo-randomly generated test v...