Over 700,000 people a year. That's how many people die due to suicide worldwide. It is therefore incredibly important to implement effective prevention interventiosn. However, for most of these interventions it is crucial to know which subgroups in the population to target.
The first part of this thesis focuses entirely on this problem, and approaches it through the lens of big data. Using data from Statistics Netherlands we start out looking at demographic data, and whether we can identify groups of high risk by their demographic features. We find many of these groups, such as men, those of middle age, those on benefits, and those living alone. We then consider whether there are intersections of these populations that are at higher risk than you would expect if these risk factors act independently. Again we find multiple unexpected groups such as male widowers, and people with a low level of education between ages 25 and 40. We subsequently looked at medication usage, and found that a great deal of classes of medication were associated with a heightened risk of suicde.
The second part focuses on the theoretical questions that arose in relation to the first part: how do you decide which features to include, is it possible to quantify dependence between observer variables? We started out designing a measure of dependence which answers the second of these questions. We showed it had a number of basic properties you would expect such a measure to have, and showed none of the reasonably commonly used measures have these properties. We then extended this to a measure of feature importance by considering how much a feature contributes to the dependency of the outcome on "coalitions" of features. We then examined certain basic properties and showed our notion of feature importance satisfied all of them, whereas most other feature importance methods had less than half, with none having more than 13 of the 20 properties.