Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages.
Following the rise of e-commerce there has been a dramatic increase in online criminal activities targeting online shoppers. Considering that the number of online stores has risen dramatically, manually checking these stores has become intractable. An automated process is therefore required. We approached this problem by applying machine learning techniques to extract and detect instances of fraudulent online stores. Two sources of information were used to determine the legitimacy of an online store. First, contextual features extracted from the HTML and meta information were used to train various machine learning algorithms. Second, visual information, like the presence of social media logos, was added to make improvements on this baseline model. Results show a positive effect for adding visual information, increasing the F1-score from 0.93 to 0.98 over the baseline model. Finally, this research shows that visual information can improve recall during web crawling. CCS CONCEPTS• Information systems → Web mining; • Computing methodologies → Machine learning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.