In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers -Logistic Regression, Naïve Bayes, and Random Forestare trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic Regression and Naïve Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 ~ 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.
The potential of things or objects generating and processing data about day-to-day activities of its users has given a new level of popularity to Internet of Things (IoT) among its consumers. Even though the popularity has seen a steady increase, the use of IoT devices has been slow and abandonment rapid. To build on the existing literature and advance our understanding of the sociological processes of use and non-use of these devices, this paper presents results from the survey of 489 IoT users. Our qualitative analysis of open ended questions revealed that the motives for use include multi-functionality of devices that provide control over daily activities, social competitive edge, economic advantage, and habit. The justifications for limiting or stopping the use include privacy concerns, information overload and inaccuracy, demotivation because of the reminders about pending or failed goals, no excitement after satisfying initial curiosity, and maintenance becoming unmanageable in terms of effort, time, and money.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.