Straightforward Feature Selection for Scalable Latent Semantic Indexing

Yan, Jun; Yan, Shuicheng; Liu, Ning; Chen, Zheng

doi:10.1137/1.9781611972795.99

Cited by 7 publications

(3 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(e.g., [2,1,11]). Other approaches for increasing the computational performance of LSA include alternatives to the SVD for dimensionality reduction (e.g., [12]) and feature selection to reduce the size of the term-document matrices (e.g., [14]). …”

Section: Related Workmentioning

confidence: 99%

ParaText

Dunlavy

Shead

Stanton

2010

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

View full text Add to dashboard Cite

Automated analysis of unstructured text documents (e.g., web pages, newswire articles, research publications, business reports) is a key capability for solving important problems in areas including decision making, risk assessment, social network analysis, intelligence analysis, scholarly research and others. However, as data sizes continue to grow in these areas, scalable processing, modeling, and semantic analysis of text collections becomes essential. In this paper, we present the ParaText text analysis engine, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents. Results on several document collections using hundreds of processors are presented to illustrate the flexibility, extensibility, and scalability of the the entire process of text modeling from raw data ingestion to application analysis.

show abstract

Section: Related Workmentioning

confidence: 99%

ParaText

Dunlavy

Shead

Stanton

2010

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

View full text Add to dashboard Cite

show abstract

“…They showed that SDD-based LSI does as well as SVD-based LSI in terms of retrieval performance while requiring only very little storage (one-twentieth) and time (one-half) to answer a query. Yan et al (2009) formulated LSI as a continuous optimization problem and made it effective on large scale collections. Hofmann (1999) presented a novel approach called Probabilistic Latent Semantic Indexing (PLSI).…”

Section: Literature Reviewmentioning

confidence: 99%

Unified linear subspace approach to semantic analysis

Kwong

Lee

2009

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

The Basic Vector Space Model (BVSM) is well known in information retrieval. Unfortunately, its retrieval effectiveness is limited because it is based on literal term matching. The Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) are two prominent semantic retrieval methods, both of which assume there is some underlying latent semantic structure in a dataset that can be used to improve retrieval performance. However, while this structure may be derived from both the term space and the document space, GVSM exploits only the former and LSI the latter. In this article, the latent semantic structure of a dataset is examined from a dual perspective; namely, we consider the term space and the document space simultaneously. This new viewpoint has a natural connection to the notion of kernels. Specifically, a unified kernel function can be derived for a class of vector space models. The dual perspective provides a deeper understanding of the semantic space and makes transparent the geometrical meaning of the unified kernel function. New semantic analysis methods based on the unified kernel function are developed, which combine the advantages of LSI and GVSM. We also prove that the new methods are stable because although the selected rank of the truncated Singular Value Decomposition (SVD) is far from the optimum, the retrieval performance will not be degraded significantly. Experiments performed on standard test collections show that our methods are promising.

show abstract

“…Kontostathis and Pottenger (2006) presented a mathematical proof that the SVD algorithm can encapsulate term co‐occurrence information. Yan, Yan, Liu, and Chen (2009) formulated LSI as a continuous optimization problem and developed a feature selection algorithm for LSI by optimizing its objective function in terms of discrete optimization. Li and colleagues (Li & Kwong, 2007b; Li, Kwong, & Lee, in press) examined the latent semantic structure of a dataset from a dual perspective; namely, they simultaneously considered the term space and the document space, and then derived a unified kernel function for a class of vector space models from this new viewpoint.…”

Section: Introduction and Previous Researchmentioning

confidence: 99%

Understanding latent semantic indexing: A topological structure analysis using Q‐analysis

Kwong

2009

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

The method of latent semantic indexing (LSI) is wellknown for tackling the synonymy and polysemy problems in information retrieval; however, its performance can be very different for various datasets, and the questions of what characteristics of a dataset and why these characteristics contribute to this difference have not been fully understood. In this article, we propose that the mathematical structure of simplexes can be attached to a term-document matrix in the vector space model (VSM) for information retrieval. The Q-analysis devised by R.H. Atkin (1974) may then be applied to effect an analysis of the topological structure of the simplexes and their corresponding dataset. Experimental results of this analysis reveal that there is a correlation between the effectiveness of LSI and the topological structure of the dataset. By using the information obtained from the topological analysis, we develop a new method to explore the semantic information in a dataset. Experimental results show that our method can enhance the performance of VSM for datasets over which LSI is not effective.

show abstract

Straightforward Feature Selection for Scalable Latent Semantic Indexing

Cited by 7 publications

References 12 publications

ParaText

ParaText

Unified linear subspace approach to semantic analysis

Understanding latent semantic indexing: A topological structure analysis using Q‐analysis

Contact Info

Product

Resources

About