Abstract. Nonlinear manifold learning from unorganized data points is a very challenging unsupervised learning and data visualization problem with a great variety of applications. In this paper we present a new algorithm for manifold learning and nonlinear dimension reduction. Based on a set of unorganized data points sampled with noise from the manifold, we represent the local geometry of the manifold using tangent spaces learned by fitting an affine subspace in a neighborhood of each data point. Those tangent spaces are aligned to give the internal global coordinates of the data points with respect to the underlying manifold by way of a partial eigendecomposition of the neighborhood connection matrix. We present a careful error analysis of our algorithm and show that the reconstruction errors are of second-order accuracy. We illustrate our algorithm using curves and surfaces both in 2D/3D and higher dimensional Euclidean spaces, and 64-by-64 pixel face images with various pose and lighting conditions. We also address several theoretical and algorithmic issues for further research and improvements.Keywords: nonlinear dimension reduction, principal manifold, tangent space, subspace alignment, eigenvalue decomposition, perturbation analysis AMS subject classifications. 15A18, 15A23, 65F15, 65F501. Introduction. Many high-dimensional data in real-world applications can be modeled as data points lying close to a low-dimensional nonlinear manifold. Discovering the structure of the manifold from a set of data points sampled from the manifold possibly with noise represents a very challenging unsupervised learning problem [2,3,4,8,9,10,13,14,15,17,18]. The discovered low-dimensional structures can be further used for classification, clustering, outlier detection and data visualization. Example low-dimensional manifolds embedded in high-dimensional input spaces include image vectors representing the same 3D objects under different camera views and lighting conditions, a set of document vectors in a text corpus dealing with a specific topic, and a set of 0-1 vectors encoding the test results on a set of multiple choice questions for a group of students [13,14,18]. The key observation is that the dimensions of the embedding spaces can be very high (e.g., the number of pixels for each images in the image collection, the number of terms (words and/or phrases) in the vocabulary of the text corpus, or the number of multiple choice questions in the test), the intrinsic dimensionality of the data points, however, are rather limited due to factors such as physical constraints and linguistic correlations. Traditional dimension reduction techniques such as principal component analysis and factor analysis usually work well when the data points lie close to a linear (affine) subspace in the input space [7]. They can not, in general, discover nonlinear structures embedded in the set of data points.Recently, there have been much renewed interests in developing efficient algorithms for constructing nonlinear low-dimensional manifolds f...
Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper investigates two supervised learning approaches to disambiguate authors in the citations 1 . One approach uses the naive Bayes probability model, a generative model; the other uses Support Vector Machines(SVMs) [39] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: co-author names, the title of the paper , and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the web, mainly publication lists from homepages, the other collected from the DBLP citation databases.
Large-scale datasets possessing clean label annotations are crucial for training Convolutional Neural Networks (CNNs). However, labeling large-scale data can be very costly and error-prone, and even high-quality datasets are likely to contain noisy (incorrect) labels. Existing works usually employ a closed-set assumption, whereby the samples associated with noisy labels possess a true class contained within the set of known classes in the training data. However, such an assumption is too restrictive for many applications, since samples associated with noisy labels might in fact possess a true class that is not present in the training data. We refer to this more complex scenario as the open-set noisy label problem and show that it is nontrivial in order to make accurate predictions. To address this problem, we propose a novel iterative learning framework for training CNNs on datasets with open-set noisy labels. Our approach detects noisy labels and learns deep discriminative features in an iterative fashion. To benefit from the noisy label detection, we design a Siamese network to encourage clean labels and noisy labels to be dissimilar. A reweighting module is also applied to simultaneously emphasize the learning from clean labels and reduce the effect caused by noisy labels. Experiments on CIFAR-10, ImageNet and real-world noisy (web-search) datasets demonstrate that our proposed model can robustly train CNNs in the presence of a high proportion of open-set as well as closed-set noisy labels.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.