We propose an unsupervised segmentation method based on an assumption about language data: that the increasing point of entropy of successive characters is the location of a word boundary. A large-scale experiment was conducted by using 200 MB of unsegmented training data and 1 MB of test data, and precision of 90% was attained with recall being around 80%. Moreover, we found that the precision was stable at around 90% independently of the learning data size.
Matrix factorization has been widely adopted for recommendation by learning latent embeddings of users and items from observed user-item interaction data. However, previous methods usually assume the learned embeddings are static or homogeneously evolving with the same diffusion rate. This is not valid in most scenarios, where users’ preferences and item attributes heterogeneously drift over time. To remedy this issue, we have proposed a novel dynamic matrix factorization model, named Dynamic Bayesian Logistic Matrix Factorization (DBLMF), which aims to learn heterogeneous user and item embeddings that are drifting with inconsistent diffusion rates. More specifically, DBLMF extends logistic matrix factorization to model the probability a user would like to interact with an item at a given timestamp, and a diffusion process to connect latent embeddings over time. In addition, an efficient Bayesian inference algorithm has also been proposed to make DBLMF scalable on large datasets. The effectiveness of the proposed method has been demonstrated by extensive experiments on real datasets, compared with the state-of-the-art methods.
Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engine and online advertising systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to 10 3 topics, which difficultly cover the long-tail semantic word sets. In this article, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling systems. In particular, we show that a "big" LDA model with at least 10 5 topics inferred from 10 9 search queries can achieve a significant improvement on industrial search engine and online advertising systems, both of which serve hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features of Peacock include hierarchical distributed architecture, real-time prediction, and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.
Zellig S. Harris's hypothesis presented in his article ÔÔFrom phoneme to morphemeÕÕ is tested; that is, how much of morpheme/word boundaries can be detected from changes in the complexity of phoneme sequences. We re-formulate his hypothesis from a more information theoretic viewpoint and use a corpus to test whether the hypothesis holds. We found that his hypothesis holds for morphemes, with an F-score of about 80%, in both English and Chinese. However, we obtained contrary results for English and Chinese with regard to word boundaries; this reflects a difference in the nature of the two languages.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.