“…Performing this operation using the CSR format is trivial, but it was observed that the maximum performance in Mflop/s sustained by a naïve implementation can reach only a small part of the machine peak performance [14]. As a means of transcending this limit, several optimization techniques have been proposed, such as reordering [24,28,29,32], data compression [22,33], blocking [1,15,23,24,28,29,31], vectorization [4,11], loop unrolling [32] and jamming [21], and software prefetching [29]. Lately, the dissemination of multi-core computers have promoted multi-threading as an important tuning technique, which can be further combined with purely sequential methods.…”