2023
DOI: 10.1007/s00122-023-04254-9
|View full text |Cite
|
Sign up to set email alerts
|

Sample size determination for training set optimization in genomic prediction

Abstract: Key message A practical approach is developed to determine a cost-effective optimal training set for selective phenotyping in a genomic prediction study. An R function is provided to facilitate the application of the approach. Abstract Genomic prediction (GP) is a statistical method used to select quantitative traits in animal or plant breeding. For this purpose, a statistical prediction model is first built that uses phenotypic and genotypic data … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 30 publications
0
2
0
Order By: Relevance
“…Here, the NDCG@k (g, g) and mean_NDCG@k (g, g) represent the relative prediction ability of the optimal training set to the entire candidate set. To avoid confusion, they were, respectively, renamed as RE_NDCG@k and RE_mean_NDCG@k. The relative NDCG values will attain the maximum of 1, if the entire candidate set is used as the training set (Wu et al 2023).…”
Section: Analysis Of Real Traitsmentioning
confidence: 99%
See 1 more Smart Citation
“…Here, the NDCG@k (g, g) and mean_NDCG@k (g, g) represent the relative prediction ability of the optimal training set to the entire candidate set. To avoid confusion, they were, respectively, renamed as RE_NDCG@k and RE_mean_NDCG@k. The relative NDCG values will attain the maximum of 1, if the entire candidate set is used as the training set (Wu et al 2023).…”
Section: Analysis Of Real Traitsmentioning
confidence: 99%
“…A practical training set size can then be accurately interpolated at a fixed acceptable NDCG or mean_NDCG value through the utility function. Most recently, Wu et al (2023) andFernández-González et al (2023), respectively, applied logistic and non-logistic growth curves to determine the sample size for the training set optimization. We will investigate this issue for the optimal training set derived from the proposed Bayesian optimization approach in a future study.…”
Section: Datasetmentioning
confidence: 99%
“…Moreover, introduced sampling bias when TS does not adequately represent the genetic diversity present in the entire population, leads to biased estimates of GEBVs [50]. Different studies examined a wide range of TS optimization methods including the mean of the coefficient of determination (CDmean), the mean of the prediction error variance (PEVmean), stratified sampling, partitioning around medoids (PAM), Rscore, generalized average genomic relationship (gAvg_GRM), and random sampling [51][52][53]. In general, it was concluded that a training set size of around 50-55% of all available genotypes usually generates accuracies in the range of 95-100% of the maximum for targeted optimization that use the information from the test set to build TS, while for untargeted optimization that does not use genomic information from a test set to determine the training set, a TS size of 65-85% is required for similar results [52].…”
Section: Discussionmentioning
confidence: 99%
“…The expected accuracy of genomic prediction is determined by the training population size, the heritability (h 2 ) of the traits, and effective number of chromosome segments underlying the traits (Daetwyler et al, 2010). However, many empirical and simulation studies in crops revealed that the prediction accuracy of the single-trait rrBLUP method for different heritabilities of traits reaches a plateau when the training set exceeds 250 individuals and the number of markers is >1000 across the genome (Asoro et al, 2011;Combs and Bernardo, 2013;Ou and Liao, 2019;Wu et al, 2023). In this study, the lines in the training set were homozygous and homogeneous DHIs from eight parents crossed in a half-diallel design, with 6636 genome-wide SNP data, and the different GP models were tested with more than 283 genotypes in the training sets by five-fold cross-validation.…”
Section: Reliability Of Phenotypes For Gpmentioning
confidence: 99%