At numerous phases of the computational process, pattern matching is essential. It enables users to search for specific DNA subsequences or DNA sequences in a database. In addition, some of these rapidly expanding biological databases are updated on a regular basis. Pattern searches can be improved by using high-speed pattern matching algorithms. Researchers are striving to improve solutions in numerous areas of computational bioinformatics as biological data grows exponentially. Faster algorithms with a low error rate are needed in real-world applications. As a result, this study offers two pattern matching algorithms that were created to help speed up DNA sequence pattern searches. The strategies recommended improve performance by utilizing word-level processing rather than character-level processing, which has been used in previous research studies. In terms of time cost, the proposed algorithms (EFLPM and EPAPM) increased performance by leveraging word-level processing with large pattern size. The experimental results show that the proposed methods are faster than other algorithms for short and long patterns. As a result, the EFLPM algorithm is 54% faster than the FLPM method, while the EPAPM algorithm is 39% faster than the PAPM method.
Summary Pattern matching is a highly useful procedure in several stages of the computational pipelines. Furthermore, some research trends in this research domain contributed to growing biological databases and updated them throughout time. This article proposes an comparison and analysis of different algorithms for match equivalent pattern matching like complexity, efficiency, and techniques. Which algorithm is best for which DNA sequence and why? This describes the different algorithms for various activities that include pattern matching as an important aspect of functionality. This article shows that BM, Horspool, ZT, QS, FS, Smith, and SSABS methods employ the bad character preprocessing function. In addition, BM, SSABS, TVSBS, and BRFS methods are using two approaches in the preprocessing stage, which decreases the preprocessing time. Furthermore, KR, QS, SSABS, BRFS, and Shift‐Or are not recommended for the long pattern, whereas ZT, FS, d‐BM, Raita, and Smith are not recommended for the short pattern. This is because they are time‐consuming and certain algorithms, such as ZT and DCPM, use a lot of time and space during the matching and search process, while others, such as d‐BM and TSW, save space and time. Although DCPM, BRFS, and QS are quicker than other algorithms, FLPM, PAPM, and LFPM rank highest in terms of complexity time.
The study proposes a novel model for DNA sequence classification that combines machine learning methods and a pattern-matching algorithm. This model aims to effectively categorize DNA sequences based on their features and enhance the accuracy and efficiency of DNA sequence classification. The performance of the proposed model is evaluated using various machine learning algorithms, and the results indicate that the SVM linear classifier achieves the highest accuracy and F1 score among the tested algorithms. This finding suggests that the proposed model can provide better overall performance than other algorithms in DNA sequence classification. In addition, the proposed model is compared to two suggested algorithms, namely FLPM and PAPM, and the results show that the proposed model outperforms these algorithms in terms of accuracy and efficiency. The study further explores the impact of pattern length on the accuracy and time complexity of each algorithm. The results show that as the pattern length increases, the execution time of each algorithm varies. For a pattern length of 5, SVM Linear and EFLPM have the lowest execution time of 0.0035 s. However, at a pattern length of 25, SVM Linear has the lowest execution time of 0.0012 s. The experimental results of the proposed model show that SVM Linear has the highest accuracy and F1 score among the tested algorithms. SVM Linear achieved an accuracy of 0.963 and an F1 score of 0.97, indicating that it can provide the best overall performance in DNA sequence classification. Naive Bayes also performs well with an accuracy of 0.838 and an F1 score of 0.94. The proposed model offers a valuable contribution to the field of DNA sequence analysis by providing a novel approach to pre-processing and feature extraction. The model’s potential applications include drug discovery, personalized medicine, and disease diagnosis. The study’s findings highlight the importance of considering the impact of pattern length on the accuracy and time complexity of DNA sequence classification algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.