As a crime of employing technical means to steal sensitive information of users, phishing is currently a critical threat facing the Internet, and losses due to phishing are growing steadily. Feature engineering is important in phishing website detection solutions, but the accuracy of detection critically depends on prior knowledge of features. Moreover, although features extracted from different dimensions are more comprehensive, a drawback is that extracting these features requires a large amount of time. To address these limitations, we propose a multidimensional feature phishing detection approach based on a fast detection method by using deep learning. In the first step, character sequence features of the given URL are extracted and used for quick classification by deep learning, and this step does not require thirdparty assistance or any prior knowledge about phishing. In the second step, we combine URL statistical features, webpage code features, webpage text features, and the quick classification result of deep learning into multidimensional features. The approach can reduce the detection time for setting a threshold. Testing on a dataset containing millions of phishing URLs and legitimate URLs, the accuracy reaches 98.99%, and the false positive rate is only 0.59%. By reasonably adjusting the threshold, the experimental results show that the detection efficiency can be improved. INDEX TERMS Phishing website detection, convolutional neural network, long short-term memory network, semantic feature, machine learning.
Topic models have been widely utilized in Topic Detection and Tracking tasks, which aim to detect, track, and describe topics from a stream of broadcast news reports. However, most existing topic models neglect semantic or syntactic information and lack readable topic descriptions. To exploit semantic and syntactic information, Language Models (LMs) have been applied in many supervised NLP tasks. However, there are still no extensions of LMs for unsupervised topic clustering. Moreover, it is difficult to employ general LMs (e.g., BERT) to produce readable topic summaries due to the mismatch between the pretraining method and the summarization task. In this paper, noticing the similarity between content and summary, first we propose a Language Model-based Topic Model (LMTM) for Topic Clustering by using an LM to generate a deep contextualized word representation. Then, a new method of training a Topic Summarization Model is introduced, where it is not only able to produce brief topic summaries but also used as an LM in LMTM for topic clustering. Empirical evaluations of two different datasets show that the proposed LMTM method achieves better performance over four baselines for JC, FMI, precision, recall and F1-score. Additionally, the generated readable and reasonable summaries also validate the rationality of our model components.
Stance detection on social media aims to identify the stance of social media users toward a topic or claim, which can provide powerful information for various downstream tasks. Many existing stance detection approaches neglect to model the deep semantic representation information in tweets and do not explore aggregating the hierarchical features among words, thus degrading performance. To address these issues, this article proposes a novel deep learning approach P retrained E mbeddings for Stance Detection with H ierarchical C apsule N etwork (PE-HCN) without complicated preprocessing. Specifically, PE-HCN first adopts a pretrained language model and then uses a related textual entailment task for fine-tuning to obtain the deep textual representations of tweets. The PE-HCN approach extends the dynamic routing scheme to cope with these deep textual representations by utilizing primary capsules for routing the information among words in each tweet and applying secondary capsules to transmit the aggregated features to each category capsule accordingly. Moreover, to improve the confidences of the category capsules, we design an adaptive feedback mechanism to dynamically strengthen the routing signals. Through experiments on three benchmark datasets, compared with the state-of-the-art baselines, the extensive results exhibit that PE-HCN achieves competitive improvements of up to 6.32%, 2.09%, and 1.8%, respectively.
Keyphrase generation (KG) aims at condensing the content from the source text to the target concise phrases. Though many KG algorithms have been proposed, most of them are tailored into deep learning settings with various specially designed strategies and may fail in solving the bias exposure problem. Reinforcement Learning (RL), a class of control optimization techniques, are well suited to compensate for some of the limitations of deep learning methods. Nevertheless, RL methods typically suffer from four core difficulties in keyphrase generation: environment interaction and effective exploration, complex action control, reward design, and task-specific obstacle. To tackle this difficult but significant task, we present RegRL-KG, including actor-critic based-reinforcement learning control and L1 policy regularization under the first principle of minimizing the maximum likelihood estimation (MLE) criterion by a sequence-to-sequence (Seq2Seq) deep learnining model, for efficient keyphrase generation. The agent utilizes an actor-critic network to control the generated probability distribution and employs L1 policy regularization to solve the bias exposure problem. Extensive experiments show that our method brings improvement in terms of the evaluation metrics on five scientific article benchmark datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.