This article introduces the Variome Annotation Schema, a schema that
aims to capture the core concepts and relations relevant to cataloguing and interpreting
human genetic variation and its relationship to disease, as described in the published
literature. The schema was inspired by the needs of the database curators of the
International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is
intended to have application to genetic variation information in a range of diseases. The
schema has been applied to a small corpus of full text journal publications on the subject
of inherited colorectal cancer. We show that the inter-annotator agreement on annotation
of this corpus ranges from 0.78 to 0.95 F-score across different entity
types when exact matching is measured, and improves to a minimum F-score
of 0.87 when boundary matching is relaxed. Relations show more variability in agreement,
but several are reliable, with the highest, cohort-has-size, reaching
0.90 F-score. We also explore the relevance of the schema to the InSiGHT
database curation process. The schema and the corpus represent an important new resource
for the development of text mining solutions that address relationships among patient
cohorts, disease and genetic variation, and therefore, we also discuss the role text
mining might play in the curation of information related to the human variome. The corpus
is available at http://opennicta.com/home/health/variome.