Exploiting SIMD Instructions in Modern Microprocessors to Optimize the Performance of Stream Ciphers

Joseph, Mabin; Rajan, John; Gandhi, Indira

doi:10.5815/ijcnis.2013.06.08

Cited by 3 publications

(3 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The buffer can be small but should be big enough to ensure the whole loop is cached during the first sync loop execution. After all threads arrive at the AP, the buffered instructions are transferred to the cache and the freed buffer entries are refilled with new instructions by prefetching; Both the instruction transfer and buffer refill can be performed in parallel and this parallel processing also helps to reduce the buffer size 2 . When the loop is fully cached, the cache segment related to the loop is locked.…”

Section: Design Of Thread Synchronization For High Instruction Cache Localitymentioning

confidence: 99%

A Low-Energy Multi-Threaded Processor Design for Application Specific Embedded Systems

Wickramasinghe¹,

Guo²

2018

Int J Comput Softw Eng

View full text Add to dashboard Cite

switching execution to the other thread. The overall execution time is t 3 and (t 1 +t 2 )≥t 3 >max{t 1 , t 2 }. However, since the two threads compete for locations in the cache, one thread may evict the instructions cached by another, and the real threaded execution turns out to be the one that is shown in Figure 1(d). As can be seen, the number of cache misses, hence the thread switching frequency, is actually increased, so is the execution time (t 4 > t 3 ). Cache misses lead to accesses to memory. Since memory access is power consuming, significant power will be consumed. In addition, the thread switching incurs overhead, which further adds woes to the design.In this paper, we investigate the micro-architectural level solutions to make the cache behave in harmony with the threaded execution in the pipeline. Since instruction cache misses account for a considerably higher performance impact than data cache misses [1] due to frequent instruction fetches, we focus this study on the instruction cache. We target applications that offer embarrassing parallelism where the same code can be executed by a number of independent threads on different data sets. Such applications can be found in real-world computing problems such as encryption [2], scientific calculation [3], multimedia processing [4] and image processing [5]. Those large computing problems demand designs of multiprocessor systems that can be built on the small building-block processors like the one we discuss in this paper, as is illustrated in Figure 2. Such embedded processors are usually constrained with resources to reduce energy and area costs. We focus on the multi threaded processor with a single pipeline and a small cache that processes light weight applications and we present a thread synchronization approach that synchronizes thread execution

show abstract

Section: Design Of Thread Synchronization For High Instruction Cache Localitymentioning

confidence: 99%

A Low-Energy Multi-Threaded Processor Design for Application Specific Embedded Systems

Wickramasinghe¹,

Guo²

2018

Int J Comput Softw Eng

View full text Add to dashboard Cite

show abstract

“…We target applications that offer embarrassing parallelism, namely the same code is executed by a number of independent threads on different data sets. Such applications can be found in real-world computing problems such as encryption [2], scientific calculations [3], multimedia processing [4] and image processing on large data [5]. Those large computing problems demand designs of multiprocessor systems that can be built on small building block processors like the one we discuss in this paper.…”

Section: Introductionmentioning

confidence: 98%

Effective hardware-level thread synchronization for high performance and power efficiency in application specific multi-threaded embedded processors

Wickramasinghe

Guo

2015

2015 33rd IEEE International Conference on Computer Design (ICCD)

View full text Add to dashboard Cite

Multi-threaded processors interleave the execution of several threads to reduce processor stalling time. Instruction cache misses usually account for a significant fraction of the overall stalling time due to frequent instruction fetches. Apart from incurring extended execution time (hence its direct impact on energy consumption), cache misses also lead to indirect power overheads and increased thread switching due to resulting main memory accesses. Therefore, minimizing instruction cache misses is important especially in designing application specific embedded processors that tend to be compact in size and consume low power. This paper aims to reduce instruction cache misses in a single pipeline processor for applications that offer embarrassing parallelism and enable the same code to be executed by a number of independent threads on different data sets. Such a design can be used as a building block processor for large multicomputer systems. We propose a micro-architectural level multithreading control design, which synchronizes the thread execution to allow cached instructions to be maximally reused by all threads. Our experiments show that our design not only increases the pipeline performance but also reduces the memory access frequency, hence effectively achieving high energy efficiency.

show abstract

“…In Ex (Execute Stage) the functional units process the task and store its result back into a record file. Finally, the commit stage retires instructions from the ROB in program order [20][21][22][23][24]. This processing flowchart is comparative to the one designed by the SimpleScalar tool set [8].…”

Section: Multi-processing and Pipelining During Simulationmentioning

confidence: 99%

Interconnect Network on Chip Topology in Multi-core Processors: A Comparative Study

Khari¹,

Kumar²,

Le³

et al. 2017

IJCNIS

View full text Add to dashboard Cite

Abstract-A variety of technologies in recent years have been developed in designing on-chip networks with the multicore system. In this endeavor, network interfaces mainly differ in the way a network physically connects to a multicore system along with the data path. Semantic substances of communication for a multicore system are transmitted as data packets. Thus, whenever a communication is made from a network, it is first segmented into sub-packets and then into fixed-length bits for flow control digits. To measure required space, energy & latency overheads for the implementation of various interconnection topologies we will be using multi2sim simulator tool that will act as research bed to experiment various tradeoffs between performance and power, and between performance and area requires analysis for further possible optimizations.

show abstract

Exploiting SIMD Instructions in Modern Microprocessors to Optimize the Performance of Stream Ciphers

Cited by 3 publications

References 2 publications

A Low-Energy Multi-Threaded Processor Design for Application Specific Embedded Systems

A Low-Energy Multi-Threaded Processor Design for Application Specific Embedded Systems

Effective hardware-level thread synchronization for high performance and power efficiency in application specific multi-threaded embedded processors

Interconnect Network on Chip Topology in Multi-core Processors: A Comparative Study

Contact Info

Product

Resources

About