O. Achandair scite author profile

Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is one of the most used distributed file systems and offer a high availability and scalability on low-cost hardware. All Hadoopframework have HDFS as their storage component. Coupled with map reduce, which is the processing component, HDFS and Map Reduce (a processing component) have become the standard platforms for any management of big data in these days. HDFS however, in terms of design has the ability to handle huge numbers of large files, but when it comes to its deployments to handle large amounts of small files it might not be very effective. This paper puts forward a new strategy of managing small files. The approach will consists of two principal phases. The first phase will deal with the consolidating of aaclients input files, storing it continuously in a particular allocated block, that is a SequenceFile format, and so on into the next blocks. In this way we avoid the use of multiple block allocations for different streams, this reduces calls for available blocks and also reduces the metadata memory on the NameNode. Note the reason for this is that groups of small files packaged in a SequenceFile on the same block require one entry instead of one of each small file. The second phase will involve analyzing the attributes of stored small files so they can be distributed them in a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access.

Improving Small File Management in Hadoop

Khoulji³

et al. 2017

Hadoop, considered nowadays as the defacto platform for managing big data, has revolutionized the way customers manage their data. As an opensource implementation of mapreduce, it was designed to offer a high scalability and availability across clusters of thousands of machines. Through it two principals' components, which is HDFS for a distributed storage and MapReduce as the distributed processing engine, companies and research studies are taking a big benefit from its capabilities.However, Hadoop was designed to handle large size files, so when it comes to a large number of small files, the performance can be heavily degraded. The small file problem has been well defined by researchers and Hadoop community, but most of the proposed approaches only deal with the pressure caused on the NameNode memory. Certainly, grouping small files in different possible formats, that are most of time supported by the actual Hadoop distribution, reduce the metadata entries and solve the memory limitation, but that remain only a part of the equation. Actually, the real impact that organizations need to solve when dealing with lot of small files, is the cluster performance when those files are processed in Hadoop clusters. In this paper, we proposed a new strategy to use efficiently some one of the common solution that group files in a MapFile format.The core idea, is to organize small files files based on specific attributes in MapFile output files, and use prefetching and caching mechanisms during read access. This would lead to less calls of metadata from the NameNode, and better I/O performance during MapReduce jobs. The experimental results show that this approach can help to obtain better access time when the cluster contain massive number of small files.

Implementation of the Flexible "Private - Public" Cloud Solution based on OpenStack

Khoulji³

et al. 2017

Cloud computing is a model that facilitates access to and manipulation of resources on demand. It is a technology that is unique today to meet the needs and demands of customers by guaranteeing a high quality of service rendered. This new model provides convenience to reorganize the current revolt in the information technology industry by ensuring cost effective and less costly solutions to meet the constraints of technical capabilities and their extension.This article will present the technical implementation of a flexible "private public" cloud computing, based on the Openstack solution, to ensure the business needs in terms of performance, and a response time of treatments tailored to customer demand with a Availability. This approach, followed by a "privatepublic" flexible cloud, will be able to communicate two clouds, the Cloud A, which includes the entire physical infrastructure of the company, while cloud B will be provided by a service provider that does not Will be called once the configurable load threshold is exceeded on the cloud A, and as soon as the resources on the private cloud A are released, the instances migrated to the Cloud B will be again remigrated to the Cloud A to minimize even the times Allocation.

Optimizing Hadoop for Small File Management

Khoulji

et al. 2017

HDFS is one of the most used distributed file systems, that offer a high availability and scalability on low cost hardware. HDFS is delivered as the storage component of Hadoop framework. Coupled with map reduce, which is the processing component, HDFS and MapReduce become the de facto platform for managing big data nowadays. However, HDFS was designed to handle specifically a huge number of large files, while when it comes to a large number of small files, Hadoop deployments may be not efficient. In this paper, we proposed a new strategy to manage small files. Our approach consists of two principal phases. The first phase is about consolidating more than only one client's small files input, and store the inputs continuously in the first allocated block, in a SequenceFile format, and so on into the next blocks. That way we avoid multiple block allocations for different streams, to reduce calls for available blocks and to reduce the metadata memory on the NameNode. This is because groups of small files packaged in a SequenceFile on the same block will require one entry instead of one for each small file. The second phase consists of analyzing attributes of stored small files to distribute them in such a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access.

New Cloud Economy For Resolving A Problem Cloud Computing Adoption

Khoulji

et al. 2017

Cloud computing , IT still called to order, it is a model that allows access to information resources on demand, we can consider it as a unique Technology to satisfy needs and respond to customer demands by guaranteeing an efficient quality of serviceIT management.This paper gives an analysis of cloud computing along its components, also the internal and external institutional factors that influence the adoption of cloud computing in the majority of enterprises.It aims to be innovative by answering these questions of allocation and availability of network services in an intradatacenter environment by implementing solutions based on new technologies that is OpenStack[1].The research problem addressed in this paper is the proposal of a new architecture and optimized algorithms of continuity and availability of the service. These will help identify the deployment of existing vulnerabilities. On the other hand, offer a very high level of availability, security and quality of service.