SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs

Li, Kaiwei; Chen, Jianfei; Chen, Wenguang; Zhu, Jun

doi:10.1109/tpds.2020.2979702

Cited by 12 publications

(26 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in the Table Ⅳ, The most frequent words in the 2019 symposium were 'ai', 'student', and 'K-12' in order, with more words related to education compared to the 2018 symposium ('learning', 'education', 'teacher', 'curriculum', etc.). (19) computer (19) working (18) session (16) service (15) science (15) program (14) system (14) teacher (14) data ( As the numbers for ecraftlearn, aiall, and k were removed during preprocessing, they are enclosed in parentheses.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Exploring the AI Topic Composition of K-12 Using NMF-based Topic Modeling

Woo

Kim

et al. 2020

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

Recently, artificial intelligence has become more prevalent due to the combination of more data, faster processing power, and more powerful algorithms. AI technology has been introduced into almost all industries and is also affecting the education sector. The objective of this study was to explore AI topics through an analysis of literature related to AI education for grades K-12 and provide implications for the composition of a system for AI education. For this purpose, 27 materials released at the 2018 and 2019 AI4K12 Symposiums were collected. Besides, artificial intelligence integration across subjects and artificial intelligence curriculum published by CBSE of India were collected for analysis. The frequency of words, word cloud, and topic modeling was performed for each collected document. According to the analysis, content on the necessary future direction for AI education and introductions to educational tools were extracted from the 2018 symposium, whereas the 2019 symposium contained more concrete discussions on how to conduct AI education in schools. Meanwhile, content involving the principles of integration for how to integrate AI with other subjects and AI-based teaching and learning methods were extracted from Artificial Intelligence Integration Across Subjects. Finally, Artificial Intelligence Curriculum covered the theories and principles of AI. This study has significance in that it analyzed how much discussion about AI education is being conducted in K-12 based on topic modelling and suggested future directions for AI education.

show abstract

Section: Resultsmentioning

confidence: 99%

“…Second, Topic modeling is a text mining technique used to discover the hidden semantic structure of the text. It is useful for exploring topics or changing topic trends according to a time series for large amounts of unstructured data, such as social media and newspaper articles [14]- [16].…”

Section: Analysis Methodsmentioning

confidence: 99%

Exploring the AI Topic Composition of K-12 Using NMF-based Topic Modeling

Woo

Kim

et al. 2020

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

show abstract

“…Due to the low system overhead, the throughput can be very high [38], but again they cannot handle large B. This category also includes some recent GPUbased systems such as SaberLDA [16] and BIDMach [40].…”

Section: Scalable Systems For Flat Modelsmentioning

confidence: 99%

“…For example, online advertisement systems extract topics from billions of search queries [34], and recommendation systems [1] need to handle millions of users and items. Various efforts has been made to develop scalable topic modeling systems, including asynchronous distributed data parallel training [1,17], hybrid data-and-model-parallel training [37,36], embarrassingly parallel BSP training [10,38,39], and GPUaccelerated training [40,16]. These topic modeling systems mainly handle the partition of the data and model and the synchronization of the count matrix across machines.…”

Section: Introductionmentioning

confidence: 99%

Scalable training of hierarchical topic models

Chen

Zhu

et al. 2018

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by treestructured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.

show abstract

“…• Expectation-Maximization (EM) techniques [10] are also applicable, which converge to the max a posteriori approximation of the posterior.…”

Section: Introductionmentioning

confidence: 99%

Pólya Urn Latent Dirichlet Allocation: A Doubly Sparse Massively Parallel Sampler

Terenin

Magnusson

Jönsson

et al. 2019

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Latent Dirichlet Allocation (LDA) is a topic model widely used in natural language processing and machine learning. Most approaches to training the model rely on iterative algorithms, which makes it difficult to run LDA on big corpora that are best analyzed in parallel and distributed computational environments. Indeed, current approaches to parallel inference either don't converge to the correct posterior or require storage of large dense matrices in memory. We present a novel sampler that overcomes both problems, and we show that this sampler is faster, both empirically and theoretically, than previous Gibbs samplers for LDA. We do so by employing a novel Pólya-urn-based approximation in the sparse partially collapsed sampler for LDA. We prove that the approximation error vanishes with data size, making our algorithm asymptotically exact, a property of importance for large-scale topic models. In addition, we show, via an explicit example, that -- contrary to popular belief in the topic modeling literature -- partially collapsed samplers can be more efficient than fully collapsed samplers. We conclude by comparing the performance of our algorithm with that of other approaches on well-known corpora.

show abstract

SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs

Cited by 12 publications

References 25 publications

Exploring the AI Topic Composition of K-12 Using NMF-based Topic Modeling

Exploring the AI Topic Composition of K-12 Using NMF-based Topic Modeling

Scalable training of hierarchical topic models

Pólya Urn Latent Dirichlet Allocation: A Doubly Sparse Massively Parallel Sampler

Contact Info

Product

Resources

About