In this paper we discuss the development and use of low-rank approximate nonnegative matrix factorization (NMF) algorithms for feature extraction and identification in the fields of text mining and spectral data analysis. The evolution and convergence properties of hybrid methods based on both sparsity and smoothness constraints for the resulting nonnegative matrix factors are discussed. The interpretability of NMF outputs in specific contexts are provided along with opportunities for future work in the modification of NMF algorithms for large-scale and time-varying datasets.
In this study, we apply a non-negative matrix factorization approach for the extraction and detection of concepts or topics from electronic mail messages. For the publicly released Enron electronic mail collection, we encode sparse term-by-message matrices and use a low rank non-negative matrix factorization algorithm to preserve natural data non-negativity and avoid subtractive basis vector and encoding interactions present in techniques such as principal component analysis. Results in topic detection and message clustering are discussed in the context of published Enron business practices and activities, and benchmarks addressing the computational complexity of our approach are provided. The resulting basis vectors and matrix projections of this approach can be used to identify and monitor underlying semantic features (topics) and message clusters in a general or high-level way without the need to read individual electronic mail messages.
The overhead of assessing technology readiness for deployment and investment purposes can be costly to both large and small businesses. Recent advances in the automatic interpretation of technology readiness levels (TRLs) of a given technology can substantially reduce the risk and associated cost of bringing these new technologies to market. Using vector-space information-retrieval models, such as latent semantic indexing, it is feasible to group similar technology descriptions by exploiting the latent structure of term usage within textual documents. Once the documents have been semantically clustered (or grouped), they can be classified based on the TRL scores of (known) nearest-neighbor documents. Three automated (no human curation) strategies for assigning TRLs to documents are discussed with accuracies as high as 86% achieved for two-class problems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.