OLAP (On-Line Analytical Processing) is a powerful method for analyzing the excessive amount of data related to business intelligence applications. OLAP utilizes the efficient multidimensional data structure referred to as the OLAP cube to answer multi-faceted analytical queries. As queries become more complex and the dimensionality and size of the cube grows, the processing time required to aggregate queries increases.In this paper, we are proposing: (1) a parallel implementation of MOLAP cube using OpenMP; (2) a text-tointeger translation method to allow effective string processing on GPU; and (3) a new scheduling algorithm that support these new features.To be able to process string queries on the GPU, we are introducing a text-to-integer translation method which works with multiple dictionaries. The translation is necessary only for the GPU side of the system. To support the translation and parallel CPU implementation, a new scheduling algorithm is proposed. The scheduler divides multi-core processor(s) of a shared memory system into a processing partition and a preprocessing (or translation) partition.The performance of the new system is evaluated. The textto-integer translation adds a new vital functionality to our system; however it also slows down the GPU processing by 7% when compare to original implementation without string support.The performance measurements indicate that due to the parallel implementation, the processing rate of the CPU partition improves from 12 to 110 queries per second. Moreover, the CPU partition is now able to process OLAP cubes of size 32 GB at rate of 11 queries per second. The total performance of the entire hybrid system (CPU + GPU) increased from 102 to 228 queries per second.
Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as seizure detection. All while outperforming prior leading solutions that relied heavily on hand-engineered techniques. However, deployment of these neural networks often requires high-computational and memory-intensive solutions. These requirements make it challenging to deploy Deep Neural Networks (DNNs) in embedded, real-time low-power applications where classic architectures, GPUs and CPUs, still impose significant power burden. Systems-on-Chip (SoC) with Field-programmable Gate Arrays (FPGAs) can be used to improve performance and allow more fine-grain control of resources than CPUs or GPUs, but it is difficult to find the optimal balance between hardware and software to improve DNN efficiency. In the current research literature there have been few proposed solutions to address optimizing hardware and software deployments of DNNs in embedded low-power systems. To address the computation resource restriction and low-power needs for deploying these networks, we describe and implement a domain-specific metric model for optimizing task deployment on differing platforms, hardware and software. Next, we propose a DNN hardware accelerator called Scalable Low-power Accelerator for real-time deep neural Networks (SCALENet) that includes multithreaded software workers. Finally, we propose a heterogeneous aware scheduler that uses the DNN-specific metric models and the SCALENet accelerator to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate the applicability of our contribution, we deploy nine modern deep network architectures, each containing a different number of parameters within the context of two different neural network applications: image processing and biomedical seizure detection. Utilizing the metric modeling techniques integrated into the heterogeneous aware scheduler and the SCALENet accelerator, we demonstrate the ability to meet computational requirements, adapt to multiple architectures, and lower power by providing an optimized task to resource allocation. Our heterogeneous aware scheduler improves power saving by decreasing power consumption by 10% of the total system power, does not affect the accuracy of the networks, and still meets the real-time deadlines. We demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU SoC and with a 4× power savings in a power envelope of 2.0W. When compared to existing FPGA-based accelerators, SCALENet’s accelerator and heterogeneous aware scheduler achieves a 4× improvement in energy efficiency.
Big data processing on hardware gained immense interest among the hardware research community to take advantage of fast processing and reconfigurability. Though the computation latency can be reduced using hardware, big data processing cost is dominated by data transfers. In this article, we propose a low overhead framework based on compressive sensing (CS) to reduce data transfers up to 67% without affecting signal quality. CS has two important kernels: “sensing” and “reconstruction.” In this article, we focus on CS reconstruction is using orthogonal matching pursuit (OMP) algorithm. We implement the OMP CS reconstruction algorithm on a domain-specific PENC many-core platform and a low-power Jetson TK1 platform consisting of an ARM CPU and a K1 GPU. Detailed performance analysis of OMP algorithm on each platform suggests that the PENC many-core platform has 15× and 18× less energy consumption and 16× and 8× faster reconstruction time as compared to the low-power ARM CPU and K1 GPU, respectively. Furthermore, we implement the proposed CS-based framework on heterogeneous architecture, in which the PENC many-core architecture is used as an “accelerator” and processing is performed on the ARM CPU platform. For demonstration, we integrate the proposed CS-based framework with a hadoop MapReduce platform for a face detection application. The results show that the proposed CS-based framework with the PENC many-core as an accelerator achieves a 26.15% data storage/transfer reduction, with an execution time and energy consumption overhead of 3.7% and 0.002%, respectively, for 5,000 image transfers. Compared to the CS-based framework implementation on the low-power Jetson TK1 ARM CPU+GPU platform, the PENC many-core implementation is 2.3× faster for the image reconstruction part, while achieving 29% higher performance and 34% better energy efficiency for the complete face detection application on the Hadoop MapReduce platform.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.