The gene content of plants varies between individuals of the same species due to gene presence/absence variation, and selection can alter the frequency of specific genes in a population. Selection during domestication and breeding will modify the genomic landscape, though the nature of these modifications is only understood for specific genes or on a more general level (e.g., by a loss of genetic diversity). Here we have assembled and analyzed a soybean (Glycine spp.) pangenome representing more than 1,000 soybean accessions derived from the USDA Soybean Germplasm Collection, including both wild and cultivated lineages, to assess genomewide changes in gene and allele frequency during domestication and breeding. We identified 3,765 genes that are absent from the Lee reference genome assembly and assessed the presence/absence of all genes across this population. In addition to a loss of genetic diversity, we found a significant reduction in the average number of protein-coding genes per individual during domestication and subsequent breeding, though with some genes and allelic variants increasing in frequency associated with selection for agronomic traits. This analysis provides a genomic perspective of domestication and breeding in this important oilseed crop.
Key messageSafeguarding crop yields in a changing climate requires bioinformatics advances in harnessing data from vast phenomics and genomics datasets to translate research findings into climate smart crops in the field.
Key message
The major soy protein QTL, cqProt-003, was analysed for haplotype diversity and global distribution, and results indicate 304 bp deletion and variable tandem repeats in protein coding regions are likely causal candidates.
Abstract
Here, we present association and linkage analysis of 985 wild, landrace and cultivar soybean accessions in a pan genomic dataset to characterize the major high-protein/low-oil associated locus cqProt-003 located on chromosome 20. A significant trait-associated region within a 173 kb linkage block was identified, and variants in the region were characterized, identifying 34 high confidence SNPs, 4 insertions, 1 deletion and a larger 304 bp structural variant in the high-protein haplotype. Trinucleotide tandem repeats of variable length present in the second exon of gene Glyma.20G085100 are strongly correlated with the high-protein phenotype and likely represent causal variation. Structural variation has previously been found in the same gene, for which we report the global distribution of the 304 bp deletion and have identified additional nested variation present in high-protein individuals. Mapping variation at the cqProt-003 locus across demographic groups suggests that the high-protein haplotype is common in wild accessions (94.7%), rare in landraces (10.6%) and near absent in cultivated breeding pools (4.1%), suggesting its decrease in frequency primarily correlates with domestication and continued during subsequent improvement. However, the variation that has persisted in under-utilized wild and landrace populations holds high breeding potential for breeders willing to forego seed oil to maximize protein content. The results of this study include the identification of distinct haplotype structures within the high-protein population, and a broad characterization of the genomic context and linkage patterns of cqProt-003 across global populations, supporting future functional characterization and modification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.