Multi-GPU based on multicriteria optimization for motion estimation system

Garcı́a, Carlos; Botella, Guillermo; Ayuso, Fermin; Prieto, Manuel; Tirado, Francisco

doi:10.1186/1687-6180-2013-23

Cited by 18 publications

(15 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This shortcoming seriously hampers the practical application of this technology as an help for diagnostic which could be done by dermatologists or even general practitioners. Fortunately, previous researches have shown that GA-based designs can be optimized via different parallel computing environments [2,13,14,36]. For KMGA, the huge quantity of data from multispectral images processing and GA's population information results in a bottleneck of memory consumption on GPUs and FPGAs, while multi-core CPUs have better overall properties, specially for efficiency and robustness performances.…”

Section: Introductionmentioning

confidence: 99%

Design and evaluation of a parallel and optimized light–tissue interaction-based method for fast skin lesion assessment

Brost

Benezeth

et al. 2015

J Real-Time Image Proc

View full text Add to dashboard Cite

In recent years, image processing technics have attracted much attention as powerful tools in the assessment of skin lesions from multispectral images. The Kubelka-Munk Genetic Algorithm (KMGA) is a novel method which has been developed for this diagnostic purpose. It combines the Kubelka-Munk light-tissue interaction model with the Genetic Algorithm optimization process, and allows quantitative measure of cutaneous tissue by computing skin parameter maps such as melanin concentration, volume blood fraction, oxygen saturation or epidermis/dermis thickness. However, its efficiency is seriously reduced by the mass floating-point operations for each pixel of the multispectral image, and this prevents the algorithm from reaching industrial standards related to cost, power and speed for clinical applications. In this paper, our work focuses on the improvement of this theoretical achievement. Therefore, we repropose a new C-based Parallel and Optimized KMGA (PO-KMGA) technique designed and optimized using multiple ways: KM model optimized re-writing, operation massively parallelized using POSIX threads, memory use optimization and routine pipelining with Intel C?? Compiler, etc.Intensive experiments demonstrate that our introduced PO-KMGA framework spends less than 10 min to finish a job that the conventional KMGA spends around two days to do in the same hardware environment with a similar algorithm performance.

show abstract

Section: Introductionmentioning

confidence: 99%

Design and evaluation of a parallel and optimized light–tissue interaction-based method for fast skin lesion assessment

Brost

Benezeth

et al. 2015

J Real-Time Image Proc

View full text Add to dashboard Cite

show abstract

“…Other approaches to reducing the search parameter space include techniques like genetic algorithms (Garcia et al 2013), simulated annealing, or even a simple random search, which has been shown to be surprisingly effective in studies of dense computational codes on CPUs (Seymour et al 2008). The savings can be dramatic: a study by Ganapathi et al (2009) that used machine learning techniques to reduce a parameter search space yielded a speed-up of 2000× over the full search, while achieving tuning results comparable to a human expert.…”

Section: Autotuningmentioning

confidence: 99%

GPU-accelerated denoising of 3D magnetic resonance images

Howison

Bethel

2014

J Real-Time Image Proc

View full text Add to dashboard Cite

The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: What parameters yield the best denois- Disclaimer. This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or the Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof or the Regents of the University of California. Copyright notice. This manuscript has been authored by an author at Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.ing results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean-squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggest that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turnaround times.

show abstract

“…There are several previous works regarding motion estimation hardware acceleration [9][10][11] and, specifically, block-matching algorithms [12], though none of them explore the custom instruction paradigm. Looking into block-matching techniques, three frequently used techniques can be classified: the full-search technique (FST) [4], the three-step-search technique (TSST) [13], and the two-dimensional logarithmic-search technique (2DLOG) [14].…”

Section: Introductionmentioning

confidence: 99%

Acceleration of block-matching algorithms using a custom instruction-based paradigm on a Nios II microprocessor

González

Botella

Garcı́a

et al. 2013

EURASIP J. Adv. Signal Process.

Self Cite

View full text Add to dashboard Cite

This contribution focuses on the optimization of matching-based motion estimation algorithms widely used for video coding standards using an Altera custom instruction-based paradigm and a combination of synchronous dynamic random access memory (SDRAM) with on-chip memory in Nios II processors. A complete profile of the algorithms is achieved before the optimization, which locates code leaks, and afterward, creates a custom instruction set, which is then added to the specific design, enhancing the original system. As well, every possible memory combination between on-chip memory and SDRAM has been tested to achieve the best performance. The final throughput of the complete designs are shown. This manuscript outlines a low-cost system, mapped using very large scale integration technology, which accelerates software algorithms by converting them into custom hardware logic blocks and showing the best combination between on-chip memory and SDRAM for the Nios II processor.

show abstract

Multi-GPU based on multicriteria optimization for motion estimation system

Cited by 18 publications

References 26 publications

Design and evaluation of a parallel and optimized light–tissue interaction-based method for fast skin lesion assessment

Design and evaluation of a parallel and optimized light–tissue interaction-based method for fast skin lesion assessment

GPU-accelerated denoising of 3D magnetic resonance images

Acceleration of block-matching algorithms using a custom instruction-based paradigm on a Nios II microprocessor

Contact Info

Product

Resources

About