Malware detection plays a crucial role in computer security. Recent researches mainly use machine learning based methods heavily relying on domain knowledge for manually extracting malicious features. In this paper, we propose MalNet, a novel malware detection method that learns features automatically from the raw data. Concretely, we first generate a grayscale image from malware file, meanwhile extracting its opcode sequences with the decompilation tool IDA. Then MalNet uses CNN and LSTM networks to learn from grayscale image and opcode sequence, respectively, and takes a stacking ensemble for malware classification. We perform experiments on more than 40,000 samples including 20,650 benign files collected from online software providers and 21,736 malwares provided by Microsoft. The evaluation result shows that MalNet achieves 99.88% validation accuracy for malware detection. In addition, we also take malware family classification experiment on 9 malware families to compare MalNet with other related works, in which MalNet outperforms most of related works with 99.36% detection accuracy and achieves a considerable speed-up on detecting efficiency comparing with two state-of-the-art results on Microsoft malware dataset.
Understanding the interactions between proteins and ligands is critical for protein function annotations and drug discovery. We report a new sequence-based template-free predictor (TargetATPsite) to identify the Adenosine-5'-triphosphate (ATP) binding sites with machine-learning approaches. Two steps are implemented in TargetATPsite: binding residues and pockets predictions, respectively. To predict the binding residues, a novel image sparse representation technique is proposed to encode residue evolution information treated as the input features. An ensemble classifier constructed based on support vector machines (SVM) from multiple random under-samplings is used as the prediction model, which is effective for dealing with imbalance phenomenon between the positive and negative training samples. Compared with the existing ATP-specific sequence-based predictors, TargetATPsite is featured by the second step of possessing the capability of further identifying the binding pockets from the predicted binding residues through a spatial clustering algorithm. Experimental results on three benchmark datasets demonstrate the efficacy of TargetATPsite.
Bag-of-words (BOW) is now the most popular way to model text in statistical machine learning approaches in sentiment analysis. However, the performance of BOW sometimes remains limited due to some fundamental deficiencies in handling the polarity shift problem. We propose a model called dual sentiment analysis (DSA), to address this problem for sentiment classification. We first propose a novel data expansion technique by creating a sentiment-reversed review for each training and test review. On this basis, we propose a dual training algorithm to make use of original and reversed training reviews in pairs for learning a sentiment classifier, and a dual prediction algorithm to classify the test reviews by considering two sides of one review. We also extend the DSA framework from polarity (positive-negative) classification to 3-class (positivenegative-neutral) classification, by taking the neutral reviews into consideration. Finally, we develop a corpus-based method to construct a pseudo-antonym dictionary, which removes DSA's dependency on an external antonym dictionary for review reversion. We conduct a wide range of experiments including two tasks, nine datasets, two antonym dictionaries, three classification algorithms and two types of features. The results demonstrate the effectiveness of DSA in addressing polarity shift in sentiment classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.