Multivariate regression, which can also be posed as a multitask machine learning problem, is used to better understand multiple outputs based on a given set of inputs. Many methods have been proposed on how to utilize shared information about responses with applications in fields such as economics, genomics, advanced manufacturing, and precision medicine. Interest in these areas coupled with the rise of large data sets (“big data”) has generated interest in how to make the computations more efficient, but also to develop methods that account for the heterogeneity that may exist between responses. One way to exploit this heterogeneity between responses is to use methods that detect groups, also called clusters, of related responses. These methods provide a framework that can increase computational speed and account for complexity of relationships of a large number of responses. With this flexibility, comes additional challenges such as how to identify these clusters of responses, model selection, and the development of more complex algorithms that combine concepts from both the supervised and unsupervised learning literature. We explore current state of the art methods, present a framework to better understand methods that utilize or detect clusters of responses, and provide insights on the computational challenges associated with this framework. Specifically we present a simulation study that discusses the challenges with model selection when detecting clusters of responses of interest. We also comment on extensions and open problems that are of interest to both the research and practitioner communities.
This article is categorized under:
Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
Statistical Learning and Exploratory Methods of the Data Sciences > Exploratory Data Analysis
Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods