Marker pre-selection as a strategy to enhance genomic prediction with machine learning: Exploring the influence of trait-specific genomic structures

Keywords: bagging; multilayer perceptron; artificial neural networks; genomic wide selection.

Abstract

This study focused on incorporating dimensionality reduction based on marker significance to better harness the potential of machine learning for genomic prediction in different trait-genomic structures. The aim was to show that outcomes achieved with reduced data would improve predictive accuracy ( ) and precision (root-mean-square error: RMSE) while reducing computational time.  Distinct subsets of markers, in simulated data, were chosen by prioritizing importance via the Bagging technique.  Predictive modelling was subsequently conducted using both Bagging and the diverse architectures of a Multilayer Perceptron (MLP) neural network. This study was carried out with six traits of an F2 simulated population (derived from contrasting homozygotes) with 1,000 individuals. Three traits had three different heritabilities (0.4, 0.6, and 0.8) and were controlled by a set of 40 quantitative trait loci (QTLs). Additionally, four QTLs with more pronounced heritability effects (set at unity) were introduced in three other traits while preserving the same genetic control structure as the earlier traits. In our investigation, as the number of markers increased, both techniques gradually increased training time; however, the time needed for computation notably extended beyond the threshold of 100 markers for Bagging.  In comparison to the MLP model, the Bagging model generally obtained better accuracy (higher ) and precision (lower RMSE) values regardless of heritability and added QTLs. Most importantly, results highlight that for traits subject to robust genetic control of additional QTLs, MLP networks experienced a decline in prediction performance from a few markers (~10). In contrast, Bagging kept constant or subtly improved predication performance.  Finally, the dimensionality reduction procedure effectively improves genomic prediction, and Bagging captures complex genetic control structures for prediction better than MLP networks.

Downloads

Download data is not yet available.

References

Akdemir, D., Jannink, J.-L., & Isidro-Sánchez, J. (2017). Locally epistatic models for genome-wide prediction and association by importance sampling. Genetics Selection Evolution, 49(74), 1-14. https://doi.org/10.1186/s12711-017-0348-8

Alkimim, E. R., Caixeta, E. T., Sousa, T. V., Resende, M. D. V., Silva, F. L., Sakiyama, N. S., & Zambolim, L. (2020). Selective efficiency of genome-wide selection in Coffea canephora breeding. Tree Genetics & Genomes, 16(41). https://doi.org/10.1007/s11295-020-01433-3

Arouisse, B., Theeuwen, T. P. J. M., Van Eeuwijk, F. A., & Kruijer, W. (2021). Improving genomic prediction using high-dimensional secondary phenotypes. Frontiers in Genetics, 12(667358), 1-12. https://doi.org/10.3389/fgene.2021.667358

Azevedo, C. F., Silva, F. F., Resende, M. D. V, Lopes, M. S., Duijvesteijn, N., Guimarães, S. E. F., Lopes, P. S., Kelly, M. J., Viana, J. M. S., & Knol, E. F. (2014). Supervised independent component analysis as an alternative method for genomic selection in pigs. Journal of Animal Breeding and Genetics, 131(6), 452-461. https://doi.org/10.1111/jbg.12104

Barbosa, I. P., Silva, M. J., Costa, W. G., Castro Sant’Anna, I., Nascimento, M., & Cruz, C. D. (2021). Genome‐enabled prediction through machine learning methods considering different levels of trait complexity. Crop Science, 61(3), 1890-1902. https://doi.org/10.1002/csc2.2048

Bergmeir, C., & Benítez, J. M. (2012). Neural networks in R using the stuttgart neural network simulator: RSNNS. Journal of Statistical Software, 46(7), 1-26. https://doi.org/10.18637/jss.v046.i07

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. https://doi.org/10.1007/BF00058655

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

Costa, V. G., & Pedreira, C. E. (2023). Recent advances in decision trees: an updated survey. Artificial Intelligence Review, 56(5), 4765-4800. https://doi.org/10.1007/s10462-022-10275-5

Crossa, J., Pérez-Rodríguez, P., Cuevas, J., Montesinos-López, O., Jarquín, D., de los Campos, G., Burgueño, J., González-Camacho, J. M., Pérez-Elizalde, S., Beyene, Y., Dreisigacker, S., Singh, R., Zhang, X., Gowda, M., Roorkiwal, M., Rutkoski, J., & Varshney, R. K. (2017). Genomic selection in plant breeding: Methods, models, and perspectives. Trends in Plant Science, 22(11), 961-975. https://doi.org/https://doi.org/10.1016/j.tplants.2017.08.011

Cruz, C. D. (2013). Genes: a software package for analysis in experimental statistics and quantitative genetics. Acta Scientiarum. Agronomy, 35(3), 271-276. https://doi.org/10.4025/actasciagron.v35i3.21251

Cruz, C. D., Regazzi, A. J., & Carneiro, P. C. S. (2012). Modelos biómetricos aplicados ao melhoramento genético. Editora UFV.

Ehret, A., Hochstuhl, D., Gianola, D., & Thaller, G. (2015). Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle. Genetics Selection Evolution, 47(22), 1-9. https://doi.org/10.1186/S12711-015-0097-5

Endelman, J. B. (2011). Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome, 4(3), 250-255. https://doi.org/10.3835/plantgenome2011.08.0024

Howard, R., Carriquiry, A. L., & Beavis, W. D. (2014). Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 Genes|Genomes|Genetics, 4(6), 1027-1046. https://doi.org/10.1534/g3.114.010298

Liaw, A., & Wiener, M. (2014). Package “randomForest”: Breiman and Cutler’s random forests for classification and regression. R Development Core Team, 4, 6-10.

Long, N., Gianola, D., Rosa, G. J. M., Weigel, K. A., Kranis, A., & González-Recio, O. (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research, 92(3), 209-225. https://doi.org/DOI: 10.1017/S0016672310000157

Long, N., Gianola, D., Rosa, G. J. M., & Weigel, K. A. (2011a). Dimension reduction and variable selection for genomic selection: application to predicting milk yield in Holsteins. Journal of Animal Breeding and Genetics, 128(4), 247-257. https://doi.org/10.1111/j.1439-0388.2011.00917.x

Long, N., Gianola, D., Rosa, G. J. M., & Weigel, K. A. (2011b). Marker-assisted prediction of non-additive genetic values. Genetica, 139(7), 843-854. https://doi.org/10.1007/s10709-011-9588-7

Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. https://doi.org/10.1093/genetics/157.4.1819

Montesinos López, O. A., Montesinos López, A., & Crossa, J. (2022). Random Forest for Genomic Prediction. In O. A. Montesinos López, A. Montesinos López, & J. Crossa (Eds.), Multivariate statistical machine learning methods for genomic prediction (pp. 633–681). Springer International Publishing. https://doi.org/10.1007/978-3-030-89010-0_15

Prasad, A. M., Iverson, L. R., & Liaw, A. (2006). Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems, 9(2), 181-199. https://doi.org/10.1007/s10021-005-0054-1

R Core Team (2022). A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org

Resende, M. D. V., Silva, F. F., Lopes, P. S., & Azevedo, C. F. (2012). Seleção genômica ampla (GWS) via modelos mistos (REML/BLUP), inferência bayesiana (MCMC), regressão aleatória multivariada e estatística espacial. UFV.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408. https://doi.org/10.1037/h0042519

Sant’Anna, I. C., Silva, G. N., Nascimento, M., & Cruz, C. D. (2020). Subset selection of markers for the genome-enabled prediction of genetic values using radial basis function neural networks. Acta Scientiarum. Agronomy, 43(1), 1-10. https://doi.org/10.4025/actasciagron.v43i1.46307

Silva, G. N., Sant’Anna, I. C., Cruz, C. D., Nascimento, M., Azevedo, C. F., & Glória, L. S. (2022). Neural networks and dimensionality reduction to increase predictive efficiency for complex traits. Genetics and Molecular Research, 21(1), 1-13. https://doi.org/10.4238/gmr18982

Silva, G. N., Tomaz, R. S., Sant’Anna, I. C., Nascimento, M., Bhering, L. L., & Cruz, C. D. (2014). Neural networks for predicting breeding values and genetic gains. Scientia Agricola, 71(6), 494-498. https://doi.org/10.1590/0103-9016-2014-0057

Sousa, I. C., Nascimento, M., Silva, G. N., Nascimento, A. C. C., Cruz, C. D., Silva, F. F., Almeida, D. P., Pestana, K. N., Azevedo, C. F., Zambolim, L., & Caixeta, E. T. (2020). Genomic prediction of leaf rust resistance to Arabica coffee using machine learning algorithms. Scientia Agricola, 78(4), 1-8. http://dx.doi.org/10.1590/1678-992X-2020-0021

Walters, R., Laurin, C., & Lubke, G. H. (2012). An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data. Bioinformatics, 28(20), 2615-2623. https://doi.org/10.1093/bioinformatics/bts483

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.

Published
2025-09-02
How to Cite
Barbosa, W. F., Silva Júnior, A. C. da, Sousa, I. C. de, Moraes, F. E. de O. C. de, Siqueira, M. J. S., Bhering, L. L., Nascimento, M., & Cruz, C. D. (2025). Marker pre-selection as a strategy to enhance genomic prediction with machine learning: Exploring the influence of trait-specific genomic structures. Acta Scientiarum. Agronomy, 47(1), e72552. https://doi.org/10.4025/actasciagron.v47i1.72552

 

2.0
2019CiteScore
 
 
60th percentile
Powered by  Scopus

 

2.0
2019CiteScore
 
 
60th percentile
Powered by  Scopus