Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. AbstractIn recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on these huge resources. Similarity based categorization algorithms such as k-nearest neighbor, generalized instance set and centroid based classification have been shown to be very effective in document categorization. A major drawback of these algorithms is that they use all features when computing the similarities. In many document data sets, only a small number of the total vocabulary may be useful for categorizing documents. A possible approach to overcome this problem is to learn weights for different features (or words in document data sets). In this report we present two fast iterative feature weight adjustment algorithms for the linearcomplexity centroid based classification algorithm. Our algorithms use a measure of the discriminating power of each term to gradually adjust the weights of all features concurrently. We experimentally evaluate our algorithms on the Reuters-21578 and OHSUMED document collections and compare it against Rocchio, Widrow-Hoff and SVM. We also compared its performance in terms of classification accuracy on data sets with multiple classes. On these data sets we compared its performance against traditional classifiers such as k-nn, Naive Bayesian and C4.5. Experimentsshow that feature weight adjustment improves the performance of the centroid-based classifier by 2-5% , substantially outperforms Rocchio and Widrow-Hoff and is competitive with SVM. These algorithms also outperform traditional classifiers such as k-nn, naive bayesian and C4.5 on the multi-class text document data sets.
Partitioning is typically employed on large-scale data to improve manageability, availability, and performance. However, for tables connected by a referential constraint (capturing a parent-child relationship), the current approaches require individually partitioning each table thereby burdening the user with the task of maintaining the tables equi-partitioned, which not only is cumbersome but also error prone. This paper proposes a new partitioning method (partition by reference) that allows tables with a parent-child relationship to be logically equi-partitioned by inheriting the partition key from the parent table without duplicating the key columns. The partitioning key is resolved through an existing parent-child relationship, enforced by an active referential constraint. This logical dependency is used to automatically i) cascade partition maintenance operations performed on parent table to child tables, and ii) handle migration of child rows when partition key or parent key in parent table is updated, as a single atomic operation. This method has been introduced in Oracle Database 11gR1 with support for tables with both single level and composite partitioning methods. The paper describes the key concepts of table partitioning by reference method, discusses the design and implementation challenges, and presents an experimental study covering a usage scenario common in Information Life Cycle Management (ILM) applications.
The detection of phishing and legitimate websites is considered a great challenge for web service providers because the users of such websites are indistinguishable. Phishing websites also create traffic in the entire network. Another phishing issue is the broadening malware of the entire network, thus highlighting the demand for their detection while massive datasets (i.e., big data) are processed. Despite the application of boosting mechanisms in phishing detection, these methods are prone to significant errors in their output, specifically due to the combination of all website features in the training state. The upcoming big data system requires MapReduce, a popular parallel programming, to process massive datasets. To address these issues, a probabilistic latent semantic and greedy levy gradient boosting (PLS-GLGB) algorithm for website phishing detection using MapReduce is proposed. A feature selection-based model is provided using a probabilistic intersective latent semantic preprocessing model to minimize errors in website phishing detection. Here, the missing data in each URL are identified and discarded for further processing to ensure data quality. Subsequently, with the preprocessed features (URLs), feature vectors are updated by the greedy levy divergence gradient (model) that selects the optimal features in the URL and accurately detects the websites. Thus, greedy levy efficiently differentiates between phishing websites and legitimate websites. Experiments are conducted using one of the largest public corpora of a website phish tank dataset. Results show that the PLS-GLGB algorithm for website phishing detection outperforms stateof-the-art phishing detection methods. Significant amounts of phishing detection time and errors are also saved during the detection of website phishing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.