The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream.The widely used practice of viewing data stream clustering algorithms as a class of onepass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003 and an offline component which uses only this summary statistics. The offline component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyramidal time frame in conjunction with a microclustering approach.Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.
In many organic reactions, the O(2) activation process involves a key step where inert ground triplet O(2) is excited to produce highly reactive singlet O(2). It remains elusive what factor induces the change in the electron spin state of O(2) molecules, although it has been discovered that the presence of noble metal nanoparticles can promote the generation of singlet O(2). In this work, we first demonstrate that surface facet is a key parameter to modulate the O(2) activation process on metal nanocrystals, by employing single-facet Pd nanocrystals as a model system. The experimental measurements clearly show that singlet O(2) is preferentially formed on {100} facets. The simulations further elucidate that the chemisorption of O(2) to the {100} facets can induce a spin-flip process in the O(2) molecules, which is achieved via electron transfer from Pd surface to O(2). With the capability of tuning O(2) activation, we have been able to further implement the {100}-faceted nanocubes in glucose oxidation. It is anticipated that this study will open a door to designing noble metal nanocatalysts for O(2) activation and organic oxidation. Another perspective of this work would be the controllability in tailoring the cancer treatment materials for high (1)O(2) production efficiency, based on the facet control of metal nanocrystals. In the cases of both organic oxidation and cancer treatment, it has been exclusively proven that the efficiency of producing singlet O(2) holds the key to the performance of Pd nanocrystals in the applications.
Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GS-DMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.