Genomic signal processing for DNA sequence clustering

Mendizabal-Ruiz, Gerardo; Román-Godínez, Israel; Torres-Ramos, Sulema; Salido-Ruiz, Ricardo A.; Vélez-Pérez, Hugo; Morales, J. Alejandro

doi:10.7717/peerj.4264

Cited by 30 publications

(31 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…He chose Euclidean distance as the similarity measure to be adopted by the K-means algorithm. This method can be used to evaluate the ability of markers or genes to distinguish organisms at different levels, identify subgroups in a group of organisms, and classify fragments of DNA sequences based on known sequences (Mendizabal-Ruiz et al, 2018). Mendizabal-Ruiz G has demonstrated that it is possible to group DNA sequences based on their frequency components.…”

Section: Dna Sequence Clusteringmentioning

confidence: 99%

Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA

Yang

Zhang

Wang

et al. 2020

Front. Bioeng. Biotechnol.

108

View full text Add to dashboard Cite

Deoxyribonucleic acid (DNA) is a biological macromolecule. Its main function is information storage. At present, the advancement of sequencing technology had caused DNA sequence data to grow at an explosive rate, which has also pushed the study of DNA sequences in the wave of big data. Moreover, machine learning is a powerful technique for analyzing largescale data and learns spontaneously to gain knowledge. It has been widely used in DNA sequence data analysis and obtained a lot of research achievements. Firstly, the review introduces the development process of sequencing technology, expounds on the concept of DNA sequence data structure and sequence similarity. Then we analyze the basic process of data mining, summary several major machine learning algorithms, and put forward the challenges faced by machine learning algorithms in the mining of biological sequence data and possible solutions in the future. Then we review four typical applications of machine learning in DNA sequence data: DNA sequence alignment, DNA sequence classification, DNA sequence clustering, and DNA pattern mining. We analyze their corresponding biological application background and significance, and systematically summarized the development and potential problems in the field of DNA sequence data mining in recent years. Finally, we summarize the content of the review and look into the future of some research directions for the next step.

show abstract

Section: Dna Sequence Clusteringmentioning

confidence: 99%

Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA

Yang

Zhang

Wang

et al. 2020

Front. Bioeng. Biotechnol.

108

View full text Add to dashboard Cite

show abstract

“…In this work, a similar algorithm is implemented to analyse nucleotide sequences: each nucleotide position in a sequence is represented as a four elements vector, the Voss representation [24], encoding the probability of each base according to previously aligned reads. This numerical representation of DNA sequence is appropriate for the comparison of DNA sequences [25] and their classification[26]. In molecular biology, a similar algorithm has been applied to the clustering of amino acid sequences [27] where vector quantization is used to estimate the probability density of amino acids.…”

Section: Methodsmentioning

confidence: 99%

Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage

2019

View full text Add to dashboard Cite

BackgroundIn short-read DNA sequencing experiments, the read coverage is a key parameter to successfully assemble the reads and reconstruct the sequence of the input DNA. When coverage is very low, the original sequence reconstruction from the reads can be difficult because of the occurrence of uncovered gaps. Reference guided assembly can then improve these assemblies. However, when the available reference is phylogenetically distant from the sequencing reads, the mapping rate of the reads can be extremely low. Some recent improvements in read mapping approaches aim at modifying the reference according to the reads dynamically. Such approaches can significantly improve the alignment rate of the reads onto distant references but the processing of insertions and deletions remains challenging.ResultsHere, we introduce a new algorithm to update the reference sequence according to previously aligned reads. Substitutions, insertions and deletions are performed in the reference sequence dynamically. We evaluate this approach to assemble a western-grey kangaroo mitochondrial amplicon. Our results show that more reads can be aligned and that this method produces assemblies of length comparable to the truth while limiting error rate when classic approaches fail to recover the correct length. Finally, we discuss how the core algorithm of this method could be improved and combined with other approaches to analyse larger genomic sequences.ConclusionsWe introduced an algorithm to perform dynamic alignment of reads on a distant reference. We showed that such approach can improve the reconstruction of an amplicon compared to classically used bioinformatic pipelines. Although not portable to genomic scale in the current form, we suggested several improvements to be investigated to make this method more flexible and allow dynamic alignment to be used for large genome assemblies.

show abstract

“…The results showed that the FTIR sampling techniques had a significant influence on the spectral characteristics, spectral quality, and sampling efficiency. Ruiz et al [32] proposed a novel approach for performing cluster analysis of DNA sequences that is based on the use of Genomic signal processing GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors.…”

Section: Dna Sequencementioning

confidence: 99%

Discriminant Analysis for the Eigenvalues of Variance Covariance Matrix of FFT Scaling of DNA Sequences: An Empirical Study of Some Organisms

Abid

Farhood²

2019

IJIIS

View full text Add to dashboard Cite

Many studies discussed different numerical representations of DNA sequences. One naive approach for exploring the nature of a DNA sequence is to assign numerical values (or scales) to the nucleotides and then proceed with standard time series methods. The analysis will depend actually on the particular assignment of numerical values.Discriminant analysis aims to examine the dependence of one qualitative (classification) variable from several quantitative variables according to number of variations of qualitative variable we can distinction. Actually, there is a discriminant analysis for two or more groups. The essential work of discriminant analysis is to get the optimal assigning rules that will minimize the likelihood of incorrect classification of elements. In this paper, we discussed the discriminant analysis of the first, second, third and fourth eigenvalues of variance covariance matrix of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences of five organisms, Human, E. coli, Rat, Wheat and Grasshopper. The analysis is based on three methods (All Variables, Forward Selection and Backward Selection) of discrimination. Functions have been reached whereby discrimination is made among organisms under consideration. Empirical studies are conducted to show the value of our point of view and the applications based on. Therefore, we recommended that, other empirical studies should be done for other organisms and statistical methods by using the point of view adopted here. Also, aspects stated here must be used in an applied manner for DNA sequences discrimination.

show abstract

Genomic signal processing for DNA sequence clustering

Cited by 30 publications

References 45 publications

Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA

Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA

Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage

Discriminant Analysis for the Eigenvalues of Variance Covariance Matrix of FFT Scaling of DNA Sequences: An Empirical Study of Some Organisms

Contact Info

Product

Resources

About