This article considers the identification conditions of confirmatory factor analysis (CFA) models for ordered categorical outcomes with invariance of different types of parameters across groups. The current practice of invariance testing is to first identify a model with only configural invariance and then test the invariance of parameters based on this identified baseline model. This approach is not optimal because different identification conditions on this baseline model identify the scales of latent continuous responses in different ways. Once an invariance condition is imposed on a parameter, these identification conditions may become restrictions and define statistically non-equivalent models, leading to different conclusions. By analyzing the transformation that leaves the model-implied probabilities of response patterns unchanged, we give identification conditions for models with invariance of different types of parameters without referring to a specific parametrization of the baseline model. Tests based on this approach have the advantage that they do not depend on the specific identification condition chosen for the baseline model.
Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual itemstheir difficulty and discriminating power -and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.
Structural equation models (SEM) are widely used for modeling complex multivariate relationships among measured and latent variables. Although several analytical approaches to interval estimation in SEM have been developed, there lacks a comprehensive review of these methods. We review the popular Wald-type and lesser known likelihood-based methods in linear SEM, emphasizing profile likelihood-based confidence intervals (CIs). Existing algorithms for computing profile likelihood-based CIs are described, including two newer algorithms which are extended to construct profile likelihood-based confidence regions (CRs). Finally, we illustrate the use of these CIs and CRs with two empirical examples, and provide practical recommendations on when to use Wald-type CIs and CRs versus profile likelihood-based CIs and CRs. OpenMx example code is provided in an Online Appendix for constructing profile likelihood-based CIs and CRs for SEM.
Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable information about model performance and behavior. Traditionally, IRT models are learned using human response pattern (RP) data, presenting a significant bottleneck for large data sets like those required for training deep neural networks (DNNs). In this work we propose learning IRT models using RPs generated from artificial crowds of DNN models. We demonstrate the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks. Parameters learned from human and machine RPs for natural language inference and sentiment analysis exhibit medium to large positive correlations. We demonstrate a use-case for latent difficulty item parameters, namely training set filtering, and show that using difficulty to sample training data outperforms baseline methods. Finally, we highlight cases where human expectation about item difficulty does not match difficulty as estimated from the machine RPs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.