To identify what entities are being talked about in tweets, we need to automatically link named entities that appear in tweets to structured KBs like WikiData. Existing approaches often struggle with such short, noisy texts, or their complex design and reliance on supervision make them brittle, difficult to use and maintain, and lose significance over time. Further, there is a lack of a large, linked corpus of tweets to aid researchers, along with lack of gold dataset to evaluate the accuracy of entity linking. In this paper, we introduce (1) Tweeki, an unsupervised, modular entity linking system for Twitter, (2) Tweeki-Data, a large, automatically-annotated corpus of Tweets linked to entities in WikiData, and (3) TweekiGold, a gold dataset for entity linking evaluation. Through comprehensive analysis, we show that Tweeki is comparable to the performance of recent state-of-the-art entity linkers models, the dataset is of high quality, and a use case of how the dataset can be used to improve downstream tasks in social media analysis (geolocation prediction).
By illuminating latent structures in a corpus of text, topic models are an essential tool for categorizing, summarizing, and exploring large collections of documents. Probabilistic topic models, such as latent Dirichlet allocation (LDA), describe how words in documents are generated via a set of latent distributions called topics. Recently, the Embedded Topic Model (ETM) has extended LDA to utilize the semantic information in word embeddings to derive semantically richer topics. As LDA and its extensions are unsupervised models, they aren't defined to make efficient use of a user's prior knowledge of the domain. To this end, we propose the Keyword Assisted Embedded Topic Model (KeyETM), which equips ETM with the ability to incorporate user knowledge in the form of informative topiclevel priors over the vocabulary. Using both quantitative metrics and human responses on a topic intrusion task, we demonstrate that KeyETM produces better topics than other guided, generative models in the literature 1 .
We propose a unified framework to evaluate and quantify the search time of multiple random searchers traversing independently and concurrently on complex networks. We find that the intriguing behaviors of multiple random searchers are governed by two basic principles-the logarithmic growth pattern and the harmonic law. Specifically, the logarithmic growth pattern characterizes how the search time increases with the number of targets, while the harmonic law explores how the search time of multiple random searchers varies relative to that needed by individual searchers. Numerical and theoretical results demonstrate these two universal principles established across a broad range of random search processes, including generic random walks, maximal entropy random walks, intermittent strategies, and persistent random walks. Our results reveal two fundamental principles governing the search time of multiple random searchers, which are expected to facilitate investigation of diverse dynamical processes like synchronization and spreading.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.