Clinicians often research complex traits in which many variables may be involved in the process underlying the disease. Nowadays, with the advent of DNA sequencing techniques, clinical studies regularly research high‐dimensional data, in which the number of variables even far exceeds the number of samples. Generic research goals are to predict certain outcomes or to select variables, using the high-dimensional data possibly in addition to clinical variables like age and sex. Examples in cancer genomics are to diagnose cancer, classify cancer type and predict survival time based on gene expression data, or to find few genes that may predict these outcomes well. While the human genome contains around 20.000 genes, clinical studies usually only include measurements for around 100 patients. This limited amount of information makes it hard to find the “right” selection or combination of variables from the vast space of options.
Luckily, more prior information on the variables is often available in the form of complementary data, or co-data, e.g. from public repositories or derived from domain knowledge. Co‐data may vary in type. Genes, for example, may be grouped in non‐overlapping groups for chromosomes, overlapping groups for pathways, hierarchical groups for gene ontology or assigned a continuous summary statistic derived from a similar study. We would like to learn from co‐data to improve prediction and variable selection.
This dissertation presents three statistical methods and software for co‐data learning. We consider co‐data learnt penalised generalised linear and Cox survival models for the outcome. The penalties on the variables are informed by the co‐data, such that variables for more important co‐data are penalised less. For example, some biological functions may be more important than others, such that genes corresponding to these biological functions are ideally penalised relatively less. The penalty parameters are related to prior parameters for a prior distribution on the variables, which are estimated with an empirical Bayes approach. The presented methods differ in the type of co‐data and penalty or prior that may be used.