Sparse matrix-vector multiplication (SpMV ) operations are commonly used in various scienti c and engineering applications. e performance of the SpMV operation o en depends on exploiting regularity pa erns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bo leneck arising from the irregular memory access pa ern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and even more demanding for the massively parallel architectures, such as on GPUs.Computational neuroscience algorithms o en involve sparse datasets while still performing long-running computations on them.e Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. e datasets employed herein involve the Sparse Tucker Decomposition (STD) -a widely used tensor decomposition method. Using this decomposition leads to multiple indirect array references, making it very di cult to optimize on both multi-core and many-core systems. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bo leneck for performance and scaling. In this work, we rst propose target-independent optimizations to optimize these SpMV operations, followed by targetdependent optimizations for CPU and GPU systems. e target-independent techniques include: (1) standard compiler optimizations to prevent unnecessary and redundant computations, (2) data restructuring techniques to minimize the e ects of indirect accesses, and(3) methods to partition computations among threads to obtain coarse-grained parallelism with low synchronization overhead. en we present the target-dependent optimizations for CPUs such as: (1) e cient synchronization-free thread mapping, and (2) utilizing BLAS calls to exploit hardware-speci c speed. Following that, we present various GPU-speci c optimizations to optimally map threads at the granularity of warps, thread blocks and grid. Furthermore, to automate the CPU-based optimizations developed for this algorithm, we also extend the PolyMage domain-speci c language, embedded in Python. Our highly optimized and parallelized CPU implementation obtain a speedup of 6.3× over the naive parallel CPU implementation running on 16-core Intel Xeon Silver (Skylake-based) system. In addition to that our optimized GPU implementation achieves a speedup of 5.2× over a reference optimized GPU code version on NVIDIA's GeForce RTX 2080 Ti GPU, and a speedup of 9.7× over our highly optimized and parallelized CPU implementation. We made following novel contributions in this work. First, we generalized the data restructuring methods and computation spli ing techniques, these are extended to CPUs. Second, we present CPU-speci c optimizations to improve the performance of the LiFE applications. ird, we describe DSL bas...