Sequence-based association studies are at a critical inflexion point with
the increasing availability of exome-sequencing data. A popular test of
association is the sequence kernel association test (SKAT). Weights are embedded
within SKAT to reflect the hypothesized contribution of the variants to the
trait variance. Because the true weights are generally unknown, and so are
subject to misspecification, we examined the efficiency of a data-driven
weighting scheme.
We propose the use of a set of theoretically defensible weighting
schemes, of which, we assume, the one that gives the largest test statistic is
likely to capture best the allele frequency-functional effect relationship. We
show that the use of alternative weights obviates the need to impose arbitrary
frequency thresholds in sequence data association analyses. As both the score
test and the likelihood ratio test (LRT) may be used in this context, and may
differ in power, we characterize the behavior of both tests.
We found that the two tests have equal power if the set of weights
resembled the correct ones. However, if the weights are badly specified, the LRT
shows superior power (due to its robustness to misspecification). With this
data-driven weighting procedure the LRT detected significant signal in genes
located in regions already confirmed as associated with schizophrenia –
the PRRC2A (P=1.020E-06) and the VARS2
(P=2.383E-06) – in the Swedish schizophrenia case-control cohort
of 11,040 individuals with exome-sequencing data.
The score test is currently preferred for its computational efficiency
and power. Indeed, assuming correct specification, in some circumstances the
score test is the most powerful. However, LRT has the advantageous properties of
being generally more robust and more powerful under weight misspecification.
This is an important result given that, arguably, misspecified models are likely
to be the rule rather than the exception in weighting-based approaches.