The Registries of Bone Marrow Donors around the world include more than 30 million volunteer donors from 57 different countries, and were responsible for over 17,000 hematopoietic stem cell transplants in 2016. The Brazilian Bone Marrow Volunteer Donor Registry (REDOME) was established in 1993 and is the third largest registry in the world with more than 4.3 million donors. We characterized HLA allele and haplotypes frequencies from REDOME comparing them with the donor self-reported race group classification. Five-locus haplotype frequencies (A~C~B~DRB1~DQB1) were estimated for each of the six race groups, resolving phase and allelic ambiguity using the expectation-maximization (EM) algorithm. The top 100 haplotypes in the race groups were separated into eight clusters of haplotypes, based on haplotype similarity, using CLUTO. We present HLA allele and haplotype frequency data from six race groups from 2,938,259 individuals from REDOME. The most frequent haplotype was the same for all groups: A*01:01g~C*07:01g~B*08:01g~DRB1*03:01g~DQB1*02:01g. Some frequent haplotypes such as A*02:01g~C*16:01g~B*44:03~DRB1*07:01g~DQB1*02:01g was not found in people with Preta (Sub-Saharan African descent). A cluster including Branca (European) and Parda or non-informed (admixed) could be distinguished from both Preta (SubSaharan) and Indígena (Amerindian) groups, and from the Amarela (Asian) ones, which clustered with their original population. These results have implications on cross-population matching and can help in donor searches and population-based recruitment strategies.
Next generation DNA sequencing is used to determine the HLA-A, -B, -C, -DRB1, and -DQB1 assignments of 1472 unrelated volunteers for the unrelated donor registry in Argentina. The analysis characterized all HLA exons and introns for class I alleles; at least exons 2, 3 for HLA-DRB1; and exons 2 to 6 for HLA-DQB1. Of the distinct alleles present, there are 330 class I and 98 class II. The majority (~98%) of the cumulative allele frequency at each locus is contributed by alleles that appear at a frequency of at least 1 in 1000. Fourteen (18.2%) of the 77 novel class I and II alleles carry nonsynonymous variation within their exons; 52 (75.4%) class I novel alleles carry only single, apparently random, nucleotide variation within their introns/untranslated regions. Alleles encoding protein variation not usually detected by typing focused only on the exons encoding the antigen recognition domain are 1.0% of the class I assignments and 7.3% of the class II assignments (predominantly DQB1*02:02:01, DQB1*03:19:01, and DRB1*14:54:01). Updates to the common and well documented list of alleles include 10 alleles previously thought to be uncommon but that are found at least 30 times. Five locus haplotypes estimated using the expectation-maximization algorithm as present 3 or more times total 187. While the known HLA diversity continues to increase, the conservation of known allele sequences is remarkable. Overall, the HLA diversity observed in the Argentinian population reflects its European and Native American ancestry.
Motivation
For over 10 years allele-level HLA matching for bone marrow registries has been performed in a probabilistic context. HLA typing technologies provide ambiguous results in that they could not distinguish among all known HLA alleles equences; therefore registries have implemented matching algorithms that provide lists of donor and cord blood units ordered in terms of the likelihood of allele-level matching at specific HLA loci. With the growth of registry sizes, current match algorithm implementations are unable to provide match results in real time.
Results
We present here a novel computationally-efficient open source implementation of an HLA imputation and match algorithm using a graph database platform. Using graph traversal, the matching algorithm runtime is practically not affected by registry size. This implementation generates results that agree with consensus output on a publicly-available match algorithm cross-validation dataset.
Availability and implementation
The Python, Perl and Neo4j code is available at https://github.com/nmdp-bioinformatics/grimm.
Supplementary information
Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.