Users are often interested in a specific type of data (user-preferred data) from a largevolume dataset. An efficient system that only stores user-preferred data from the large dataset can reduce the search latency, which allows the users to search for relevant information in a timely manner. The motivation behind this thesis is to devise a technique that filters a large dataset and stores only the filtered data, thereby saving storage space for the user. Running the filtering operation can be CPU-intensive, which can lead to high latency in extracting preferred data from the dataset. To solve this problem, the technique employs parallel processing and machine learning. A proof-of-concept prototype for this technique has been built on Apache Spark. The performance of the prototype subjected to synthetic datasets is analyzed. The analysis of experimental results shows the viability of this technique and provides insights into the system behavior and performance.