A new fast technique for pattern matching in biological sequences

Ibrahim, Osman Ali Sadek; Hamed, Belal A.; El‐Hafeez, Tarek Abd

doi:10.1007/s11227-022-04673-3

Cited by 17 publications

(7 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 4 provides a comparison of the machine learning techniques discussed above with traditional techniques, including our two proposed methods (EFLPM and EPAPM) [47] based on their execution time for different pattern lengths in DNA sequences.…”

Section: The Experimental Resultsmentioning

confidence: 99%

“…We opted to build databases on DNA to examine the machine learning algorithms discussed in the following paragraphs. The rationale behind this decision was to sample some of the genes that we had worked on in our previous research endeavors [15], Our objective was to integrate automated learning algorithms and pattern-matching algorithms that are based on specific DNA sequences, in order to create a biological data collection that could be utilized in a classification process. We conducted experiments on a dataset that included DNA sequences, where we compared the effectiveness of searching for a specific pattern with other classification models, such as Random Forest [3,16], KNN [16][17][18][19][20], Naïve Bayes [21][22][23][24], Decision tree [23,[25][26][27][28][29][30], and Support Vector Machine [18,[31][32][33][34][35][36] with Linear [37,38], RBF [37,39], and sigmoid [21,40] classifiers, the results of these classifiers models are calculated by F1 score, recall, precision rate, execution time, and with the accuracy which calculates the most effective patternmatching classifier.…”

Section: Methodology For Pm From Dna Sequencesmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing classification efficiency with machine learning techniques for pattern matching

Hamed¹,

Ibrahim²,

El‐Hafeez³

2023

J Big Data

Self Cite

View full text Add to dashboard Cite

The study proposes a novel model for DNA sequence classification that combines machine learning methods and a pattern-matching algorithm. This model aims to effectively categorize DNA sequences based on their features and enhance the accuracy and efficiency of DNA sequence classification. The performance of the proposed model is evaluated using various machine learning algorithms, and the results indicate that the SVM linear classifier achieves the highest accuracy and F1 score among the tested algorithms. This finding suggests that the proposed model can provide better overall performance than other algorithms in DNA sequence classification. In addition, the proposed model is compared to two suggested algorithms, namely FLPM and PAPM, and the results show that the proposed model outperforms these algorithms in terms of accuracy and efficiency. The study further explores the impact of pattern length on the accuracy and time complexity of each algorithm. The results show that as the pattern length increases, the execution time of each algorithm varies. For a pattern length of 5, SVM Linear and EFLPM have the lowest execution time of 0.0035 s. However, at a pattern length of 25, SVM Linear has the lowest execution time of 0.0012 s. The experimental results of the proposed model show that SVM Linear has the highest accuracy and F1 score among the tested algorithms. SVM Linear achieved an accuracy of 0.963 and an F1 score of 0.97, indicating that it can provide the best overall performance in DNA sequence classification. Naive Bayes also performs well with an accuracy of 0.838 and an F1 score of 0.94. The proposed model offers a valuable contribution to the field of DNA sequence analysis by providing a novel approach to pre-processing and feature extraction. The model’s potential applications include drug discovery, personalized medicine, and disease diagnosis. The study’s findings highlight the importance of considering the impact of pattern length on the accuracy and time complexity of DNA sequence classification algorithms.

show abstract

Section: The Experimental Resultsmentioning

confidence: 99%

Section: Methodology For Pm From Dna Sequencesmentioning

confidence: 99%

Optimizing classification efficiency with machine learning techniques for pattern matching

Hamed¹,

Ibrahim²,

El‐Hafeez³

2023

J Big Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…//Create substring based on pattern length (5) for (i � 0 to m) do (6) sum ⟵ ASCII (s i ) (7) end for loop (8) //Create hash value using predefned prime number (9) h (s) ⟵ sum mod q (10) //Create quotient value using predefned prime number (11) r (s) ⟵ sum divide q (12) if h(p) � h(s) and r(p) � r(s)…”

Section: Searching Phasementioning

confidence: 99%

“…To comprehend biological data, mainly when the datasets are enormous and complicated, the interdisciplinary discipline of bioinformatics develops techniques and software tools [5]. Pattern matching issues appear in many computational bioinformatics tasks, including basic local synchronization search, biomarker discovery, sequence matching, homologous sequence identifcation, and proteogenomic mapping [6,7]. Pattern matching can be used in biotechnology, forensics, medical, and agricultural research to look into probable disease or anomaly diagnoses [8].…”

Section: Introductionmentioning

confidence: 99%

An Improved Hashing Approach for Biological Sequence to Solve Exact Pattern Matching Problems

Mahmud,

Rahman,

Hasan Talukder

2023

Applied Computational Intelligence and Soft Computing

View full text Add to dashboard Cite

Pattern matching algorithms have gained a lot of importance in computer science, primarily because they are used in various domains such as computational biology, video retrieval, intrusion detection systems, and fraud detection. Finding one or more patterns in a given text is known as pattern matching. Two important things that are used to judge how well exact pattern matching algorithms work are the total number of attempts and the character comparisons that are made during the matching process. The primary focus of our proposed method is reducing the size of both components wherever possible. Despite sprinting, hash-based pattern matching algorithms may have hash collisions. The Efficient Hashing Method (EHM) algorithm is improved in this research. Despite the EHM algorithm’s effectiveness, it takes a lot of time in the preprocessing phase, and some hash collisions are generated. A novel hashing method has been proposed, which has reduced the preprocessing time and hash collision of the EHM algorithm. We devised the Hashing Approach for Pattern Matching (HAPM) algorithm by taking the best parts of the EHM and Quick Search (QS) algorithms and adding a way to avoid hash collisions. The preprocessing step of this algorithm combines the bad character table from the QS algorithm, the hashing strategy from the EHM algorithm, and the collision-reducing mechanism. To analyze the performance of our HAPM algorithm, we have used three types of datasets: E. coli, DNA sequences, and protein sequences. We looked at six algorithms discussed in the literature and compared our proposed method. The Hash-q with Unique FNG (HqUF) algorithm was only compared with E. coli and DNA datasets because it creates unique bits for DNA sequences. Our proposed HAPM algorithm also overcomes the problems of the HqUF algorithm. The new method beats older ones regarding average runtime, number of attempts, and character comparisons for long and short text patterns, though it did worse on some short patterns.

show abstract

“…Computational methods, on the other hand, are efficient and effective, and they play an important role in many areas of bioinformatics. For example, in silico techniques are being used rapidly in research on diseasegene interactions [25,26], protein structure prediction [27], peptide therapeutic function, gene editing experiments [28], meaningful pattern detection [29], and drug repurposing [30,31]. Previously, researchers have been proposed a few computational models for predicting the 2-OM sites based on single machine learning (ML) and deep learning (DL) approaches [32][33][34][35][36][37][38][39].…”

Section: Introductionmentioning

confidence: 99%

Untitled

View full text Add to dashboard Cite

A new fast technique for pattern matching in biological sequences

Cited by 17 publications

References 31 publications

Optimizing classification efficiency with machine learning techniques for pattern matching

Optimizing classification efficiency with machine learning techniques for pattern matching

An Improved Hashing Approach for Biological Sequence to Solve Exact Pattern Matching Problems

Untitled

Contact Info

Product

Resources

About