Due to the rapid increase in the amount of data generated in many fields of science and engineering, information retrieval methods tailored to large-scale datasets have become increasingly important in the last years. Semantic hashing is an emerging technique for this purpose that works on the idea of representing complex data objects, like images and text, using similarity-preserving binary codes that are then used for indexing and search. In this paper, we investigate a hashing algorithm that uses a deep variational auto-encoder to learn and predict the codes. Unlike previous approaches of this type, that learn a continuous (Gaussian) representation and then project the embedding to obtain hash codes, our method employs Bernoulli latent variables in both the training and the prediction stage. Constraining the model to use a binary encoding allow us to obtain a more interpretable representation for hashing: each factor in the generative model represents a bit that should help to reconstruct and thus identify the input pattern. Interestingly, we found that the binary constraint does not lead to a loss but an increase of search accuracy. We argue that continuous formulations learn a representation that can significantly differ from the code used for search. Minding this gap in the design of the auto-encoder can translate into more accurate retrieval results. Extensive experiments on seven datasets involving image data and text data illustrate these findings and demonstrate the advantages of our approach.
The field of astronomical data analysis has experienced an important paradigm shift in the recent years. The automation of certain analysis procedures is no longer a desirable feature for reducing the human effort, but a must have asset for coping with the extremely large datasets that new instrumentation technologies are producing. In particular, the detection of transit planets --- bodies that move across the face of another body --- is an ideal setup for intelligent automation. Knowing if the variation within a light curve is evidence of a planet, requires applying advanced pattern recognition methods to a very large number of candidate stars. Here we present a supervised learning approach to refine the results produced by a case-by-case analysis of light-curves, harnessing the generalization power of machine learning techniques to predict the currently unclassified light-curves. The method uses feature engineering to find a suitable representation for classification, and different performance criteria to evaluate them and decide. Our results show that this automatic technique can help to speed up the very time-consuming manual process that is currently done by expert scientists.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.