In gene expression analysis, the expression levels of thousands of genes are analyzed, such as separate stages of treatments or diseases. Identifying particular gene sequence pattern is a challenging task with respect to performance issues. The proposed solution addresses the performance issues in genomic stream matching by involving assembly and sequencing. Counting the k-mer based on k-input value and while performing DNA sequencing tasks, the researches need to concentrate on sequence matching. The proposed solution addresses performance issue metrics such as processing time for k-mer counting, number of operations for matching similarity, memory utilization while performing similarity search, and processing time for stream matching. By suggesting an improved algorithm, Revised Rabin Karp(RRK) for basic operation and also to achieve more efficiency, the proposed solution suggests a novel framework based on Hadoop MapReduce blended with Pig & Apache Tez. The measure of memory utilization and processing time proposed model proves its efficiency when compared to existing approaches.
Introduction: The primary structure of the protein is a polypeptide chain made up of a sequence of amino acids. What happens due to interaction between the atoms of the backbone is that it forms within a polypeptide a folded structure which is very much within the secondary structure. These alignments can be made more accurate by the inclusion of secondary structure information. Objective: It is difficult to identify the sequence information embedded in the secondary structure of the protein. However, Deep learning methods can be used for solving the identification of the sequence information in the protein structures. Methods: The scope of the proposed work is to increase the accuracy of identifying the sequence information in the primary structure and the tertiary structure, thereby increasing the accuracy of the predicted protein secondary structure (PSS). In this proposed work, homology is eliminated by a Recurrent Neural Network (RNN) based network that consists of three layers namely bi-directional Long Short term Memory (LSTM), time distributed layer and Softmax layer. Results: The proposed LDS model achieves an accuracy of approx. 86% for the prediction of the three-state secondary structure of the protein. Conclusion: The gap between the number of protein primary structures and secondary structures we know is huge and increasing. Machine learning is trying to reduce this gap. In most of the other pre attempts in predicting the secondary structure of proteins the data is divided according to homology of the proteins. This limits the efficiency of the predicting model and limits the inputs given to such models. Hence in our model homology has not been considered while collecting the data for training or testing out model. This has led to our model to not be affected by the homology of the protein fed to it and hence remove that restriction, so any protein can be fed to it.
A key step in addressing the classification issue was the selection of genes for removing redundant and irrelevant genes. The proposed Type Combination Approach –Feature Selection(TCA-FS) model uses the efficient feature selection methods, and the classification accuracy can be enhanced. The three classifiers such as K Nearest Neighbour(KNN), Support Vector Machine(SVM) and Random Forest(RF) are selected for evaluating the opted feature selection methods, and prediction accuracy. The effects of three new approaches for feature selection are Improved Recursive Feature Elimination (IRFE), Revised Maximum Information co-efficient (RMIC), as well as Upgraded Masked Painter (UMP), are analysed. These three proposed techniques are compared with existing techniques and are validated with (i) Stability determination test. (ii) Classification accuracy. (iii) Error rates of three proposed techniques are analysed. Due to the selection of proper threshold on classification, the proposed TCA-FS method provides a higher accuracy compared to the existing system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.