Classification of imbalanced data has become a widespread problem due to the fact that the most real world datasets are imbalanced. In a classification task, one of the challenges is to learn the feature-space of classification under class-imbalance setting. The majority classes generally have good representation of features in the learned classification function and the minority classes lack this representation; subsequently, the classification for these classes failed more often. In this paper, authors investigate the task of document classification with topic map based representation of documents under class imbalance setting. In order to measure of topic-map based representation for classification under imbalance data, authors compare three representations: Bag-of-Words, Phrases and Topic terms for three approaches (i) under-sampling, (ii) cost-adjusting, and (iii) cluster based sampling. A series of experiments are carried out and results are reported.
Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task.
The new category of databases, the NoSQL databases are horizontally scalable as such these databases are very compatible for use at data centers that require very large size databases with variety of data types. Performance of SQL Server2012 and Cassandra was compared in a limited scenario but it was quite clear that for kind of database required for business, the relational databases are the choice. NoSQL technology is improving at a fast pace and different types of databases are coming into the market. New schema free environments and flexible table designs offer a lot to look forward. The four different types of NoSQL databases are providing specialized utilization for specific technology areas.
Graph Partitioning is one of the favorite research topics among researchers since the 70s. It attracts a diverse group of researchers from various fields such as engineering, science and mathematics. In the last decade, the graphs have increased in size to billions of vertices. Despite the fact that storage devices have become cheaper, processing these huge spanning graphs is not possible for a single machine. This call for the need of partitioning the graph so a group of machines can perform various parallel calculations on them which would save time and produce quick results. The research problem is that the ratio of boundary vertices to interior vertices increases with the increase in number of partitions for existing partitioning techniques available. To address this issue, the random edge selection method of Graph Lab algorithm was replaced with four suggested edge sorting techniques. The results were compared with the random edge selection method of Graph Lab using various performance parameters.
-There is an abundance of duplicated web documents on the internet. For example, two documents online could be very similar to each other except for a very small portion, such as URLs and advertisements. While such differences are not important with regards to web searches, they do tamper with web search results due to duplication. Therefore, if web crawlers could check the duplication percentage of newly crawled pages by a previously crawled page, the quality of web search will significantly increase. The main objective of this research is to propose a method which is able to check the duplication ratio of the content on the page with the one already crawled previously. The solution includes running a web crawling algorithm in order to calculate the ratio of duplication at the time of web crawling. In order to effectively achieve the goals of this research, Charikar's SIMHASH finger printing-technique has been used. Using this, a new technique for the purpose of detection of exact and near duplication method will be devised which will work to check the duplication ratio with the newly crawled page. The experiment is carried out on multiple pages of two major B2B website namely Ali Baba and Trade key. More than 300 pages from two similar categories on each portal were selected for this experiment. These selected pages were first calculated using a third party duplication detection tool to set the bench mark. The results obtained from the test looked to be very promising and close to the benchmark set. The system running time was very short. However, the results show an average curve variation of 10% away from the bench mark which in this case is fine. Based on the results obtained from the experiment carried out, it can be said that Charikar's SIMHASH finger printing technique can be effectively used to detect duplication and near duplication.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.