Machine learning has become the predominant problem-solving strategy for computational linguistics problems in the last decade. Many researchers work on improving algorithms, developing new ones, testing feature representation issues, and so forth. Other researchers, however, apply machine-learning techniques as off-the-shelf implementation, often with little knowledge about the algorithms and intricacies of data representation issues. In this book, Daelemans and van den Bosch provide an indepth introduction to Memory-Based Language Processing (MBLP) that shows for different problems in NLP how the technique is successfully applied. Apart from the more practical issues, the book also explores the suitability of the chosen learning paradigm, memory-based learning (Stanfill and Waltz 1986), for NLP problems. Thus the book is a valuable source of information for a wide range of readers from the linguist interested in applying machine-learning techniques or the machine-learning specialist with no prior experience in NLP to the expert in machine learning wanting to learn more about the appropriateness of the MBLP bias for NLP problems.Memory-based learning is a machine-learning method based on the idea that examples can be re-used directly in processing natural language problems. Training examples are stored without modification or abstraction. During the classification process, the most similar examples from the training data are located, and their class is used to classify the new example.The book addresses different levels of understanding and working with MBLP: On one level, it explains the theoretical concepts of memory-based learning; on another, it provides more practical information: The implementation of memory-based learning, TiMBL, is described as well as different extensions such as FamBL and MBT. On a different level, the application of these techniques is described for typical problems in natural language processing. The reader learns how to model standard classification problems such as POS tagging, as well as sequence-learning problems, which are more difficult to model as classification problems. Daelemans and van den Bosch also cover critical issues, such as problems that arise in the evaluation of such experiments and the automation of searching for suitable system parameter settings. On a more abstract level, they approach the question of how suitable the bias of MBLP is. In chapter 6, they compare memory-based learning as an instance of lazy learning to an instance of eager learning, rule induction, with regard to their classification accuracy if, for example, more abstraction is introduced. Since MBLP does not abstract over the training data, it is called a lazy learning approach. Rule induction, in contrast, learns rules and does not go back to the actual training data during classification.
While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overload on the Web requires intelligent systems to identify potential risks automatically. The focus of this paper is on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying. We describe the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for the task. Experiments on a hold-out test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1 score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems.
Finding negation signals and their scope in text is an important subtask in information extraction. In this paper we present a machine learning system that finds the scope of negation in biomedical texts. The system combines several classifiers and works in two phases. To investigate the robustness of the approach, the system is tested on the three subcorpora of the BioScope corpus representing different text types. It achieves the best results to date for this task, with an error reduction of 32.07% compared to current state of the art results.
A common characteristic of communication on online social networks is that it happens via short messages, often using nonstandard language variations. These characteristics make this type of text a challenging text genre for natural language processing. Moreover, in these digital communities it is easy to provide a false name, age, gender and location in order to hide one's true identity, providing criminals such as pedophiles with new possibilities to groom their victims. It would therefore be useful if user profiles can be checked on the basis of text analysis, and false profiles flagged for monitoring. This paper presents an exploratory study in which we apply a text categorization approach for the prediction of age and gender on a corpus of chat texts, which we collected from the Belgian social networking site Netlog. We examine which types of features are most informative for a reliable prediction of age and gender on this difficult text type and perform experiments with different data set sizes in order to acquire more insight into the minimum data size requirements for this task.
We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.