Knowledge graph embedding research has overlooked the problem of probability calibration. We show popular embedding models are indeed uncalibrated. That means probability estimates associated to predicted triples are unreliable. We present a novel method to calibrate a model when ground truth negatives are not available, which is the usual case in knowledge graphs. We propose to use Platt scaling and isotonic regression alongside our method. Experiments on three datasets with ground truth negatives show our contribution leads to well calibrated models when compared to the gold standard of using negatives. We get significantly better results than the uncalibrated models from all calibration methods. We show isotonic regression offers the best the performance overall, not without trade-offs. We also show that calibrated models reach state-of-the-art accuracy without the need to define relation-specific decision thresholds.
INTRODUCTIONKnowledge graph embedding models are neural architectures that learn vector representations (i.e. embeddings) of nodes and edges of a knowledge graph. Such knowledge graph embeddings have applications in knowledge graph completion, knowledge discovery, entity resolution, and link-based clustering, just to cite a few (Nickel et al., 2016a).Despite burgeoning research, the problem of calibrating such models has been overlooked, and existing knowledge graph embedding models do not offer any guarantee on the probability estimates they assign to predicted facts. Probability calibration is important whenever you need the predictions to make probabilistic sense, i.e., if the model predicts a fact is true with 80% confidence, it should to be correct 80% of the times. Prior art suggests to use a sigmoid layer to turn logits returned by models into probabilities (Nickel et al., 2016a) (also called the expit transform), but we show that this provides poor calibration. Figure 1 shows reliability diagrams for off-the-shelf TransE and ComplEx. The identity function represents perfect calibration. Both models are miscalibrated: all TransE combinations in Figure 1a under-forecast the probabilities (i.e. probabilities are too small), whereas ComplEx under-forecasts or over-forecasts according to which loss is used (Figure1b).Calibration is crucial in high-stakes scenarios such as drug-target discovery from biological networks, where end-users need trustworthy and interpretable decisions. Moreover, since probabilities are not calibrated, when classifying triples (i.e. facts) as true or false, users must define relationspecific thresholds, which can be awkward for graphs with a great number of relation types.