Next word prediction is a helpful feature for various typing subsystems. It is also convenient to have suggestions while typing to speed up the writing of digital documents. Therefore, researchers over time have been trying to enhance the capability of such a prediction system. Knowledge regarding the inner meaning of the words along with the contextual understanding of the sequence can be helpful in enhancing the next word prediction capability. Theoretically, these reasonings seem to be very promising. With the advancement of Natural Language Processing (NLP), these reasonings are found to be applicable in real scenarios. NLP techniques like Word embedding and sequential contextual modeling can help us to gain insight into these points. Word embedding can capture various relations among the words and explain their inner knowledge. On the other hand, sequence modeling can capture contextual information. In this paper, we figure out which embedding method works better for Bengali next word prediction. The embeddings we have compared are word2vec skip-gram, word2vec CBOW, fastText skip-gram and fastText CBOW. We have applied them in a deep learning sequential model based on LSTM which was trained on a large corpus of Bengali texts. The results reveal some useful insights about the contextual and sequential information gathering that will help to implement a context-based Bengali next word prediction system.
DUJASE Vol. 7 (2) 8-15, 2022 (July)
Feature selection methods are used as a preliminary step in different areas of machine learning. Feature selection usually involves ranking the features or extracting a subset of features from the original dataset. Among various types of feature selection methods, distance-based methods are popular for their simplicity and better accuracy. Moreover, they can capture the interaction among the features for a particular application. However, it is difficult to decide the appropriate feature subset for better accuracy from the ranked feature set. To solve this problem, in this paper we propose Relief based Feature Subset Selection (RFSS), a method to capture more interactive and relevant feature subset for obtaining better accuracy. Experimental result on 16 benchmark datasets demonstrates that the proposed method performs better in comparison to the state-of-the-art methods.
DUJASE Vol. 6 (2) 7-13, 2021 (July)
Mutual information (MI) based feature selection methods are getting popular as its ability to capture the nonlinear and linear relationship among random variables and thus it performs better in different fields of machine learning. Traditional MI based feature selection algorithms use different techniques to find out the joint performance of features and select the relevant features among them. However, to do this, in many cases, they might incorporate redundant features. To solve these issues, we propose a feature selection method, namely Clustering based Feature Selection (CbFS), to cluster the features in such a way so that redundant and complementary features are grouped in the same cluster. Then, a subset of representative features is selected from each cluster. Experimental results of CbFS and four state-of-the-art methods are reported to measure the excellency of CbFS over twenty benchmark UCI datasets and three renowned network intrusion datasets. It shows that CbFS performs better than the comparative methods in terms of accuracy and performs better in identifying attack or normal instances in security datasets.
DUJASE Vol. 7 (2) 47-55, 2022 (July)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.