A fuzzy-matching clustering algorithm is applied to clustering similar client names in the lobbying Disclosure Database. Due to errors and inconsistencies in manual typing, the name of a client often has multiple representations including erroneously spelled names and sometimes shorthand forms, presenting difficulties in associating lobbying activities and interests with one single client. Therefore, there is a need to consolidate various forms of names of the same client into one group/cluster. For efficient clustering, we applied a series of preprocessing techniques before calculating the string distance between two client names. An optimized threshold selection has been adopted, which helps improve clustering accuracy. A single linkage hierarchical clustering technique has been introduced to cluster the client names. The algorithm proves to be effective in clustering similar client names. It also helps to find the representative name for a particular client cluster.
Classifying phishing websites can be expensive both computationally and financially given a large enough volume of suspect sites. A distributed cloud environment can reduce the computational time and financial cost significantly. To test this idea, we apply a multi-modal feature classification algorithm to classify phishing websites in a non-distributed and several distributed environments. A multi-modal approach combines both visual and text features for classification. The implementation extracts color feature and histogram feature from the screenshot of a phishing website and text from its html source code. Feature extraction and comparison is accomplished by applying the MapReduce framework. Implementing the multimodal approach in a distributed environment proves to reduce the runtime as well as the financial costs. We present results that show our work is 30 times faster than existing state of the art systems in phishing website classification problem. Keywords-Phishing, Map Reduce, Color code I. Contributions:The contributions of this paper are as follows:1. We develop a high performance multi-modal phishing website classification system using MapReduce.2. We conduct performance evaluations to demonstrate the significant performance and cost advantage of our system over the existing state-of-the-art.3. We conduct extensive experiments on real cloud using Amazon EMR and Amazon S3.Organization: The rest of the paper explores the details of the multi-modal phish classification algorithm and its performance in a non-distributed versus a distributed environment. In Section II, we discuss the related research. Section III presents our approach and algorithms. We describe our distance measures for the classification task in Section IV, and the classification technique in Section V. In Section VI, we discuss our experimental setup and results, and provide an analysis in
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.