The probabilistic interpretation of Canonical Correlation Analysis (CCA) for learning low-dimensional real vectors, called as latent variables, has been exploited immensely in various fields. This study takes a step further by demonstrating the potential of CCA in discovering a latent state that captures the contextual information within the textual data under a two-view setting. The interpretation of CCA discussed in this study utilizes the multi-view nature of textual data, i.e. the consecutive sentences in a document or turns in a dyadic conversation, and has a strong theoretical foundation. Furthermore, this study proposes a model using CCA to perform the Automatic Short Answer Grading (ASAG) task. The empirical analysis confirms that the proposed model delivers competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, and adaptable and should be used as the baseline especially when labeled training data is scarce or nonexistent.
Semantic Textual Similarity (STS) is a task in NLP that compares two sentences in a sentence-pair and scores the relationship between them using the degree of semantic equivalence. It has wide applicability in various fields. Consequently, the research around the task is constantly evolving. The demand for new as well as improved methods is endless. Numerous methods have been proposed that largely belong to either unsupervised or supervised learning approaches. The model proposed here is fairly simple and provides a fresh take on this classification problem using spectral learning. The model does not engage a large labeled corpus or lexical database like most STS supervised and unsupervised methods. Although, supervised STS methods achieve an accuracy that outperforms humans in some cases, but are often held back due to a lack of interpretation of the features instrumental in molding the decision-making process. The proposed model on the other hand generates features (latent knowledge) that are easy to ascertain and have a mathematical foundation. Given a sentence pair, the work focuses on finding latent states and variables from each sentence and performs classification by generating a similarity score. The latent variables are a result of projections learned by performing Canonical Correlation Analysis (CCA) amongst the sentence pair. To perform matching and determine the similarity score, Cosine similarity and Word Mover’s Distance (WMD) are employed. The performance of the proposed model does exhibit an improvement over various sophisticated supervised techniques such as LSTM and BiLSTM.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.