Two-step genomic prediction using artificial neural networks - an effective strategy for reducing computational costs and increasing prediction accuracy

Maurício de Oliveira  Celeri; Cynthia Aparecida Valiati  Barreto; Wagner Faria  Barbosa; Leísa Pires  Lima; Lucas Souza da  Silveira; Ana Carolina Campana Nascimento; Moyses Nascimento; Camila Ferreira Azevedo

doi:10.4025/actasciagron.v47i1.69089

Maurício de Oliveira Celeri Universidade Federal de Viçosa https://orcid.org/0000-0002-1032-3589
Cynthia Aparecida Valiati Barreto Universidade Federal de Viçosa http://orcid.org/0000-0003-0474-4587
Wagner Faria Barbosa Universidade Federal de Viçosa https://orcid.org/0000-0001-8725-2099
Leísa Pires Lima Instituto Federal de Educação, Ciência e Tecnologia do Sudeste de Minas Gerais http://orcid.org/0000-0003-2205-4192
Lucas Souza da Silveira Universidade Federal de Viçosa https://orcid.org/0000-0003-4356-751X
Ana Carolina Campana Nascimento Universidade Federal de Viçosa http://orcid.org/0000-0002-6985-1490
Moyses Nascimento Universidade Federal de Viçosa http://orcid.org/0000-0001-5886-9540
Camila Ferreira Azevedo Universidade Federal de Viçosa http://orcid.org/0000-0003-0438-5123

DOI: https://doi.org/10.4025/actasciagron.v47i1.69089

Palavras-chave: multivariate adaptive regression splines; boosting; artificial neural network; genetic breeding.

Resumo

Artificial neural networks (ANNs) are powerful nonparametric tools for estimating genomic breeding values (GEBVs) in genetic breeding. One significant advantage of ANNs is their ability to make predictions without requiring prior assumptions about data distribution or the relationship between genotype and phenotype. However, ANNs come with a high computational cost, and their predictions may be underestimated when including all molecular markers. This study proposes a two-step genomic prediction procedure using ANNs to address these challenges. Initially, molecular markers were selected either directly through Multivariate Adaptive Regression Splines (MARS) or indirectly based on their importance, identified through Boosting, considering the top 5, 20, and 50% of markers with the highest significance. Subsequently, the selected markers were employed for genomic prediction using ANNs. This approach was applied to two simulated traits: one with ten trait-controlling loci and heritability of 0.4 (Scenario SC1) and the other with 100 trait-controlling loci and a heritability of 0.2 (Scenario SC2). Comparisons were made between ANN predictions using marker selection and those without any marker selection. Reducing the number of markers proved to be an efficient strategy, resulting in improved accuracy, reduced mean squared error (MSE), and shorter adjustment times. The best ANN predictions were obtained with ten markers selected by MARS in SC1, and the top 5% most relevant markers selected using Boosting in SC2. As a result, in SC1, predictions using MARS achieved over a 31% increase in accuracy and a 90% reduction in MSE. In SC2, predictions using Boosting resulted in more than a 15% increase in accuracy and an 83% reduction in MSE. For both scenarios, computational time was up to ten times shorter with marker selection. Overall, the two-step prediction procedure emerged as an effective strategy for enhancing the computational and predictive performance of ANN models.

Downloads

Não há dados estatísticos.

Referências

Abdulelah Al-Sudani, Z., Salih, S. Q., Sharafati, A., & Yaseen, Z. M. (2019). Development of multivariate adaptive regression spline integrated with differential evolution model for streamflow simulation. Journal of Hydrology, 573, 1-12. DOI: https://doi.org/10.1016/J.JHYDROL.2019.03.004

Aono, A. H., Francisco, F. R., Souza, L. M., Gonçalves, P. S., Scaloppi Junior, E. J., Le Guen, V., … Souza, A. P. (2022). A divide-and-conquer approach for genomic prediction in rubber tree using machine learning. Scientific Reports, 12, 1-14. DOI: https://doi.org/10.1038/s41598-022-20416-z

Azevedo, C. F., Resende, M. D. V., Silva, F. F., Lopes, P. S., & Guimarães, S. E. F. (2013). Regressão via componentes independentes aplicada à seleção genômica para características de carcaça em suínos. Pesquisa Agropecuária Brasileira, 48(6), 619-626. DOI: https://doi.org/10.1590/S0100-204X2013000600007

Costa, J. A., Azevedo, C. F., Nascimento, M., Silva, F. F., Resende, M. D. V., & Nascimento, A. C. C. (2020). Genomic prediction with the additive-dominant model by dimensionality reduction methods. Pesquisa Agropecuária Brasileira, 55, 1-11. DOI: https://doi.org/10.1590/S1678-3921.pab2020.v55.01713

Costa, W. G., Celeri, M. O., Barbosa, I. P., Silva, G. N., Azevedo, C. F., Borém, A., ... Cruz, C. D. (2022). Genomic prediction through machine learning and neural networks for traits with epistasis. Computational and Structural Biotechnology Journal, 20, 5490–5499. DOI: https://doi.org/10.1016/j.csbj.2022.09.029

Crossa, J., Pérez-Rodríguez, P., Cuevas, J., Montesinos-López, O., Jarquín, D., de los Campos, G., … Varshney, R. K. (2017). Genomic selection in plant breeding: methods, models, and perspectives. Trends in Plant Science, 22(11), 961-975. DOI: https://doi.org/10.1016/j.tplants.2017.08.011

Cruz, C. D. (2013). Genes: a software package for analysis in experimental statistics and quantitative genetics. Acta Scientiarum. Agronomy, 35(3), 271-276. DOI: https://doi.org/10.4025/actasciagron.v35i3.21251

Cruz, C. D., & Nascimento, M. (2018). Inteligência computacional aplicada ao melhoramento genético (1. ed.). Viçosa, MG: Editora UFV.

Ehret, A., Hochstuhl, D., Gianola, D., & Thaller, G. (2015). Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle. Genetics Selection Evolution, 47(1), 1-9. DOI: https://doi.org/10.1186/S12711-015-0097-5

Fialho, I. C., Azevedo, C. F., Nascimento, A. C. C., Teixeira, F. R. F., Resende, M. D. V., & Nascimento, M. (2023). Factor analysis applied in genomic prediction considering different density marker panels in rice. Euphytica, 219(9), 88. DOI: https://doi.org/10.1007/s10681-023-03214-0

Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29(5), 1189-1232. DOI: https://doi.org/10.1214/aos/1013203451

Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1-67. DOI: https://doi.org/10.1214/AOS/1176347963

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York, NY: Heidelberg Dordrecht; London, UK: Springer. DOI: https://doi.org/10.1007/978-1-4614-7138-7

Glória, L. S., Cruz, C. D., Vieira, R. A. M., Resende, M. D. V., Lopes, P. S., Siqueira, O. H. G. B. D., & Fonseca e Silva, F. (2016). Accessing marker effects and heritability estimates from genome prediction by Bayesian regularized neural networks. Livestock Science, 191, 91-96. DOI: https://doi.org/10.1016/j.livsci.2016.07.015

Goddard, M. E., Hayes, B. J., & Meuwissen, T. H. E. (2011). Using the genomic relationship matrix to predict the accuracy of genomic selection. Journal of Animal Breeding and Genetics, 128(6), 409-421. DOI: https://doi.org/10.1111/j.1439-0388.2011.00964.x

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). International Statistical Review, 77(3), 482. DOI: https://doi.org/10.1111/j.1751-5823.2009.00095_18.x

Ho, D. S. W., Schierding, W., Wake, M., Saffery, R., & O’Sullivan, J. (2019). Machine learning SNP based prediction for precision medicine. Frontiers in Genetics, 10(267), 1-10. DOI: https://doi.org/10.3389/fgene.2019.00267

Howard, R., Carriquiry, A. L., & Beavis, W. D. (2014). Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 Genes|Genomes|Genetics, 4(6), 1027-1046. DOI: https://doi.org/10.1534/g3.114.010298

Huang, H., Ji, X., Xia, F., Huang, S., Shang, X., Chen, H., … Mei, K. (2020). Multivariate adaptive regression splines for estimating riverine constituent concentrations. Hydrological Processes, 34(5), 1213-1227. DOI: https://doi.org/10.1002/HYP.13669

Kao, L. J., & Chiu, C. C. (2020). Application of integrated recurrent neural network with multivariate adaptive regression splines on SPC-EPC process. Journal of Manufacturing Systems, 57, 109-118. DOI: https://doi.org/10.1016/j.jmsy.2020.07.020

Li, B., Zhang, N., Wang, Y. G., George, A. W., Reverter, A., & Li, Y. (2018). Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Frontiers in Genetics, 9(237), 1-20. DOI: https://doi.org/10.3389/fgene.2018.00237

Long, N., Gianola, D., Rosa, G. J. M., & Weigel, K. A. (2011). Marker-assisted prediction of non-additive genetic values. Genetica, 139(7), 843-854. DOI: https://doi.org/10.1007/s10709-011-9588-7

Long, N., Gianola, D., Rosa, G. J. M., Weigel, K. A., Kranis, A., & Gonzlez-Recio, O. (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research, 92(3), 209-225. DOI: https://doi.org/10.1017/S0016672310000157

Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. DOI: https://doi.org/10.1093/genetics/157.4.1819

Montesinos López, O. A., Montesinos López, A., & Crossa, J. (2022). Multivariate statistical machine learning methods for genomic prediction. Cham, GE: Springer.

Nayana, B. M., Kumar, K. R., & Chesneau, C. (2022). Wheat yield prediction in India using principal component analysis-multivariate adaptive regression splines (PCA-MARS). AgriEngineering, 4(2), 461-474. DOI: https://doi.org/10.3390/agriengineering4020030

Ogutu, J. O., Piepho, H. P., & Schulz-Streeck, T. (2011). A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proceedings, 5(Suppl. 3), 1-5. DOI: https://doi.org/10.1186/1753-6561-5-S3-S11

Paixão, P. T. M., Nascimento, A. C. C., Nascimento, M., Azevedo, C. F., Oliveira, G. F., Silva F. L., & Caixeta, E. T. (2022). Factor analysis applied in genomic selection studies in the breeding of Coffea canephora. Euphytica, 218(42), 1-9. DOI: https://doi.org/10.1007/s10681-022-02998-x

Park, J., & Kim, J. (2018). Defining heatwave thresholds using an inductive machine learning approach. PLoS ONE, 13(11), 1-11. DOI: https://doi.org/10.1371/journal.pone.0206872

R Core Team. (2022). R: A language and environment for statistical computing. Vienna, AT: R Foundation for Statistical Computing.

Resende, M. D. V., Silva, F. F., & Azevedo, C. F. (2014). Estatística matemática, biométrica e computacional: Modelos mistos, multivariados, categóricos e generalizados (REML/BLUP), inferência bayesiana, regressão aleatória, seleção genômica, QTL-GWAS, estatística espacial e temporal, competição, sobrevivência. Viçosa, MG: Editora UFV.

Rosado, R. D. S., Cruz, C. D., Barili, L. D., Souza Carneiro, J. E., Carneiro, P. C. S., Carneiro, V. Q., Silva, J.T., & Nascimento M. (2020). Artificial neural networks in the prediction of genetic merit to flowering traits in bean cultivars. Agriculture, 10(12). DOI: https://doi.org/10.3390/agriculture10120638

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408. DOI: https://doi.org/10.1037/H0042519

Sant’Anna, I. C., Nascimento, M., Silva, G. N., Cruz, C. D., Azevedo, C. F., Gloria, L. S., & Silva, F. F. (2020a). Genome-enabled prediction of genetic values for using radial basis function neural networks. Functional Plant Breeding Journal, 1(2), 1-8. DOI: https://doi.org/10.35418/2526-4117/v1n2a1

Sant’Anna, I. C., Silva, G. N., Nascimento, M., & Cruz, C. D. (2020b). Subset selection of markers for the genome-enabled prediction of genetic values using radial basis function neural networks. Acta Scientiarum. Agronomy, 43(1), 1-10. DOI: https://doi.org/10.4025/actasciagron.v43i1.46307

Silva, G. N., Sant’Anna, I. C., Cruz, C. D., Nascimento, M., Azevedo, C. F., & Gloria, L. S. (2022). Neural networks and dimensionality reduction to increase predictive efficiency for complex traits. Genetics and Molecular Research, 21(1), 1-13. DOI: https://doi.org/10.4238/gmr18982

Silveira, L. S., Lima, L. P., Nascimento, M., Nascimento, A. C. C., & Silva, F. F. (2020). Regression trees in genomic selection for carcass traits in pigs. Genetics and Molecular Research, 19(1), 1-11. DOI: https://doi.org/10.4238/GMR18498

Song, H., & Hu, H. (2022). Strategies to improve the accuracy and reduce costs of genomic prediction in aquaculture species. Evolutionary Applications, 15(4), 578-590. DOI: https://doi.org/10.1111/eva.13262

Sousa, I. C., Nascimento, M., Sant’anna, I. C., Caixeta, E. T., Azevedo, C. F., Cruz, C. D., ... Vergara Lopes Serão, N. (2022). Marker effects and heritability estimates using additive-dominance genomic architectures via artificial neural networks in Coffea canephora. PLoS ONE, 17(1), 1-14. DOI: https://doi.org/10.1371/journal.pone.0262055

Sousa, I. C., Nascimento, M., Silva, G. N., Nascimento, A. C. C., Cruz, C. D., Almeida, D. P., ... Caixeta, E. T. (2020). Genomic prediction of leaf rust resistance to Arabica coffee using machine learning algorithms. Scientia Agricola, 78(4), 1-8. DOI: https://doi.org/10.1590/1678-992X-2020-0021

Voss-Fels, K. P., Cooper, M., & Hayes, B. J. (2019). Accelerating crop genetic gains with genomic selection. Theoretical and Applied Genetics, 132, 669-686. DOI: https://doi.org/10.1007/s00122-018-3270-8

Westhues, C. C., Mahone, G. S., Silva, S., Thorwarth, P., Schmidt, M., Richter, J. C., … Beissinger, T. M. (2021). Prediction of maize phenotypic traits with genomic and environmental predictors using gradient boosting frameworks. Frontiers in Plant Science, 12(699589), 1-22. DOI: https://doi.org/10.3389/FPLS.2021.699589

Xu, Y., Zhang, X., Li, H., Zheng, H., Zhang, J., Olsen, M. S., … Qian, Q. (2022). Smart breeding driven by big data, artificial intelligence and integrated genomic-enviromic prediction. Molecular Plant, 15(11), 1664-1695. DOI: https://doi.org/10.1016/j.molp.2022.09.001

Zabihi, M., Pourghasemi, H. R., Motevalli, A., & Zakeri, M. A. (2019). Gully erosion modeling using gis-based data mining techniques in Northern Iran: A comparison between boosted regression tree and multivariate adaptive regression spline. In H. R. Pourghasemi, & M. Rossi (Eds.), Natural hazards gis-based spatial modeling using data mining techniques (p. 1-26). Cham, GE: Springer. DOI: https://doi.org/10.1007/978-3-319-73383-8_1

Two-step genomic prediction using artificial neural networks - an effective strategy for reducing computational costs and increasing prediction accuracy

Resumo

Downloads

Referências

Funding data