In recent years, the quick expansion of the data such as text, image, audio, video, data centers, and backup data has caused to a lot of problems in both storage and recovery processes. The companies spend plenty of money to store the data. Hence, a need for an efficient technique becomes necessary for handling enormous data. In this paper, we propose to set up for new de-duplication for the contents of big data set. The divisors are selected in an automated way using the fields separator, different dictionary indexing methods will be used to de-duplicate the fields contents those have bounded variability. Also a set of computationally low-cost hash functions will be used for speeding up the deduplication for fields consist of long strings. The number, nature, and length of fields will be tested. Besides that, certain kinds of indexing and clustering methodology will be applied to define the optimal way to decrease the data size before making de-duplication.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.