quanteda is an R package providing a comprehensive workflow and toolkit for natural language processing tasks such as corpus management, tokenization, analysis, and visualization. It has extensive functions for applying dictionary analysis, exploring texts using keywords-in-context, computing document and feature similarities, and discovering multi-word expressions through collocation scoring. Based entirely on sparse operations, it provides highly efficient methods for compiling document-feature matrices and for manipulating these or using them in further quantitative analysis. Using C++ and multithreading extensively, quanteda is also considerably faster and more efficient than other R and Python packages in processing large textual data. . quanteda: An R package for the quantitative analysis of textual data.
Selection and design of individualized treatments remains a key goal in cancer therapeutics; prediction of response and tumor recurrence following a given therapy provides a basis for subsequent personalized treatment design. We demonstrate an approach towards this goal with the example of photodynamic therapy (PDT) as the treatment modality and photoacoustic imaging (PAI) as a non-invasive, response and disease recurrence monitor in a murine model of glioblastoma (GBM). PDT is a photochemistry-based, clinically-used technique that consumes oxygen to generate cytotoxic species, thus causing changes in blood oxygen saturation (StO2). We hypothesize that this change in StO2 can be a surrogate marker for predicting treatment efficacy and tumor recurrence. PAI is a technique that can provide a 3D atlas of tumor StO2 by measuring oxygenated and deoxygenated hemoglobin. We demonstrate that tumors responding to PDT undergo approximately 85% change in StO2 by 24-hrs post-therapy while there is no significant change in StO2 values in the non-responding group. Furthermore, the 3D tumor StO2 maps predicted whether a tumor was likely to regrow at a later time point post-therapy. Information on the likelihood of tumor regrowth that normally would have been available only upon actual regrowth (10-30 days post treatment) in a xenograft tumor model, was available within 24-hrs of treatment using PAI, thus making early intervention a possibility. Given the advances and push towards availability of PAI in the clinical settings, the results of this study encourage applicability of PAI as an important step to guide and monitor therapies (e.g. PDT, radiation, anti-angiogenic) involving a change in StO2.
There is a growing interest in quantitative analysis of large corpora among the international relations (IR) scholars, but many of them find it difficult to perform analysis consistently with existing theoretical frameworks using unsupervised machine learning models to further develop the field. To solve this problem, we created a set of techniques that utilize a semisupervised model that allows researchers to classify documents into predefined categories efficiently. We propose a dictionary making procedure to avoid inclusion of words that are likely to confuse the model and deteriorate the its classification performance classification accuracy using a new entropy-based diagnostic tool. In our experiments, we classify sentences of the United Nations General Assembly speeches into six predefined categories using the seeded Latent Dirichlet allocation and Newsmap, which were trained with a small “seed word dictionary” that we created following the procedure. The result shows that, while keyword dictionary can only classify 25% of sentences, Newsmap can classify over 60% of them accurately correctly and; its accuracy exceeds 70% when contextual information is taken into consideration by kernel smoothing of topic likelihoods. We argue that once seed word dictionaries are created by the international relations community, semisupervised models would become more useful than unsupervised models for theory-driven text analysis.
Many social scientists recognize that quantitative text analysis is a useful research methodology, but its application is still concentrated in documents written in European languages, especially English, and few sub-fields of political science, such as comparative politics and legislative studies. This seems to be due to the absence of flexible and cost-efficient methods that can be used to analyze documents in different domains and languages. Aiming to solve this problem, this paper proposes a semisupervised document scaling technique, called Latent Semantic Scaling (LSS), which can locate documents on various pre-defined dimensions. LSS achieves this by combining user-provided seed words and latent semantic analysis (word embedding). The article demonstrates its flexibility and efficiency in largescale sentiment analysis of New York Times articles on the economy and Asahi Shimbun articles on politics. These examples show that LSS can produce results comparable to that of the Lexicoder Sentiment Dictionary (LSD) in both English and Japanese with only small sets of sentiment seed words. A new heuristic method that assists LSS users to choose a near-optimal number of singular values to obtain word vectors that best capture differences between documents on target dimensions is also presented. Many social scientists recognize that quantitative text analysis is a useful research methodology, but its application is still concentrated in documents written in European languages, especially English, and few sub-fields of political science, such as comparative politics and legislative studies. This is not because they are only interested in domestic politics in North America and Europe but because the existing quantitative text analysis toolkit is not suitable for analysis of documents in non-European languages or in other fields. The domination of European languages in quantitative text analysis is partially due to its history: many of the text analysis dictionaries, including the Lexicoder Sentiment Dictionary (LSD) (Young & Soroka, 2012) and the Linguistic Inquiry and Word Count (LIWC), were created based on the General Inquirer dictionaries (Stone et al., 1966), which were developed to analyze English texts during the Cold War; statistical analysis of textual data was also introduced to political science for CONTACT Kohei Watanabe
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.