Two-step genomic prediction using artificial neural networks - an effective strategy for reducing computational costs and increasing prediction accuracy
Resumo
Artificial neural networks (ANNs) are powerful nonparametric tools for estimating genomic breeding values (GEBVs) in genetic breeding. One significant advantage of ANNs is their ability to make predictions without requiring prior assumptions about data distribution or the relationship between genotype and phenotype. However, ANNs come with a high computational cost, and their predictions may be underestimated when including all molecular markers. This study proposes a two-step genomic prediction procedure using ANNs to address these challenges. Initially, molecular markers were selected either directly through Multivariate Adaptive Regression Splines (MARS) or indirectly based on their importance, identified through Boosting, considering the top 5, 20, and 50% of markers with the highest significance. Subsequently, the selected markers were employed for genomic prediction using ANNs. This approach was applied to two simulated traits: one with ten trait-controlling loci and heritability of 0.4 (Scenario SC1) and the other with 100 trait-controlling loci and a heritability of 0.2 (Scenario SC2). Comparisons were made between ANN predictions using marker selection and those without any marker selection. Reducing the number of markers proved to be an efficient strategy, resulting in improved accuracy, reduced mean squared error (MSE), and shorter adjustment times. The best ANN predictions were obtained with ten markers selected by MARS in SC1, and the top 5% most relevant markers selected using Boosting in SC2. As a result, in SC1, predictions using MARS achieved over a 31% increase in accuracy and a 90% reduction in MSE. In SC2, predictions using Boosting resulted in more than a 15% increase in accuracy and an 83% reduction in MSE. For both scenarios, computational time was up to ten times shorter with marker selection. Overall, the two-step prediction procedure emerged as an effective strategy for enhancing the computational and predictive performance of ANN models.
Downloads
Referências
Abdulelah Al-Sudani, Z., Salih, S. Q., Sharafati, A., & Yaseen, Z. M. (2019). Development of multivariate adaptive regression spline integrated with differential evolution model for streamflow simulation. Journal of Hydrology, 573, 1-12. DOI: https://doi.org/10.1016/J.JHYDROL.2019.03.004
Aono, A. H., Francisco, F. R., Souza, L. M., Gonçalves, P. S., Scaloppi Junior, E. J., Le Guen, V., … Souza, A. P. (2022). A divide-and-conquer approach for genomic prediction in rubber tree using machine learning. Scientific Reports, 12, 1-14. DOI: https://doi.org/10.1038/s41598-022-20416-z
Azevedo, C. F., Resende, M. D. V., Silva, F. F., Lopes, P. S., & Guimarães, S. E. F. (2013). Regressão via componentes independentes aplicada à seleção genômica para características de carcaça em suínos. Pesquisa Agropecuária Brasileira, 48(6), 619-626. DOI: https://doi.org/10.1590/S0100-204X2013000600007
Costa, J. A., Azevedo, C. F., Nascimento, M., Silva, F. F., Resende, M. D. V., & Nascimento, A. C. C. (2020). Genomic prediction with the additive-dominant model by dimensionality reduction methods. Pesquisa Agropecuária Brasileira, 55, 1-11. DOI: https://doi.org/10.1590/S1678-3921.pab2020.v55.01713
Costa, W. G., Celeri, M. O., Barbosa, I. P., Silva, G. N., Azevedo, C. F., Borém, A., ... Cruz, C. D. (2022). Genomic prediction through machine learning and neural networks for traits with epistasis. Computational and Structural Biotechnology Journal, 20, 5490–5499. DOI: https://doi.org/10.1016/j.csbj.2022.09.029
Crossa, J., Pérez-Rodríguez, P., Cuevas, J., Montesinos-López, O., Jarquín, D., de los Campos, G., … Varshney, R. K. (2017). Genomic selection in plant breeding: methods, models, and perspectives. Trends in Plant Science, 22(11), 961-975. DOI: https://doi.org/10.1016/j.tplants.2017.08.011
Cruz, C. D. (2013). Genes: a software package for analysis in experimental statistics and quantitative genetics. Acta Scientiarum. Agronomy, 35(3), 271-276. DOI: https://doi.org/10.4025/actasciagron.v35i3.21251
Cruz, C. D., & Nascimento, M. (2018). Inteligência computacional aplicada ao melhoramento genético (1. ed.). Viçosa, MG: Editora UFV.
Ehret, A., Hochstuhl, D., Gianola, D., & Thaller, G. (2015). Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle. Genetics Selection Evolution, 47(1), 1-9. DOI: https://doi.org/10.1186/S12711-015-0097-5
Fialho, I. C., Azevedo, C. F., Nascimento, A. C. C., Teixeira, F. R. F., Resende, M. D. V., & Nascimento, M. (2023). Factor analysis applied in genomic prediction considering different density marker panels in rice. Euphytica, 219(9), 88. DOI: https://doi.org/10.1007/s10681-023-03214-0
Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29(5), 1189-1232. DOI: https://doi.org/10.1214/aos/1013203451
Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1-67. DOI: https://doi.org/10.1214/AOS/1176347963
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York, NY: Heidelberg Dordrecht; London, UK: Springer. DOI: https://doi.org/10.1007/978-1-4614-7138-7
Glória, L. S., Cruz, C. D., Vieira, R. A. M., Resende, M. D. V., Lopes, P. S., Siqueira, O. H. G. B. D., & Fonseca e Silva, F. (2016). Accessing marker effects and heritability estimates from genome prediction by Bayesian regularized neural networks. Livestock Science, 191, 91-96. DOI: https://doi.org/10.1016/j.livsci.2016.07.015
Goddard, M. E., Hayes, B. J., & Meuwissen, T. H. E. (2011). Using the genomic relationship matrix to predict the accuracy of genomic selection. Journal of Animal Breeding and Genetics, 128(6), 409-421. DOI: https://doi.org/10.1111/j.1439-0388.2011.00964.x
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). International Statistical Review, 77(3), 482. DOI: https://doi.org/10.1111/j.1751-5823.2009.00095_18.x
Ho, D. S. W., Schierding, W., Wake, M., Saffery, R., & O’Sullivan, J. (2019). Machine learning SNP based prediction for precision medicine. Frontiers in Genetics, 10(267), 1-10. DOI: https://doi.org/10.3389/fgene.2019.00267
Howard, R., Carriquiry, A. L., & Beavis, W. D. (2014). Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 Genes|Genomes|Genetics, 4(6), 1027-1046. DOI: https://doi.org/10.1534/g3.114.010298
Huang, H., Ji, X., Xia, F., Huang, S., Shang, X., Chen, H., … Mei, K. (2020). Multivariate adaptive regression splines for estimating riverine constituent concentrations. Hydrological Processes, 34(5), 1213-1227. DOI: https://doi.org/10.1002/HYP.13669
Kao, L. J., & Chiu, C. C. (2020). Application of integrated recurrent neural network with multivariate adaptive regression splines on SPC-EPC process. Journal of Manufacturing Systems, 57, 109-118. DOI: https://doi.org/10.1016/j.jmsy.2020.07.020
Li, B., Zhang, N., Wang, Y. G., George, A. W., Reverter, A., & Li, Y. (2018). Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Frontiers in Genetics, 9(237), 1-20. DOI: https://doi.org/10.3389/fgene.2018.00237
Long, N., Gianola, D., Rosa, G. J. M., & Weigel, K. A. (2011). Marker-assisted prediction of non-additive genetic values. Genetica, 139(7), 843-854. DOI: https://doi.org/10.1007/s10709-011-9588-7
Long, N., Gianola, D., Rosa, G. J. M., Weigel, K. A., Kranis, A., & Gonzlez-Recio, O. (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research, 92(3), 209-225. DOI: https://doi.org/10.1017/S0016672310000157
Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. DOI: https://doi.org/10.1093/genetics/157.4.1819
Montesinos López, O. A., Montesinos López, A., & Crossa, J. (2022). Multivariate statistical machine learning methods for genomic prediction. Cham, GE: Springer.
Nayana, B. M., Kumar, K. R., & Chesneau, C. (2022). Wheat yield prediction in India using principal component analysis-multivariate adaptive regression splines (PCA-MARS). AgriEngineering, 4(2), 461-474. DOI: https://doi.org/10.3390/agriengineering4020030
Ogutu, J. O., Piepho, H. P., & Schulz-Streeck, T. (2011). A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proceedings, 5(Suppl. 3), 1-5. DOI: https://doi.org/10.1186/1753-6561-5-S3-S11
Paixão, P. T. M., Nascimento, A. C. C., Nascimento, M., Azevedo, C. F., Oliveira, G. F., Silva F. L., & Caixeta, E. T. (2022). Factor analysis applied in genomic selection studies in the breeding of Coffea canephora. Euphytica, 218(42), 1-9. DOI: https://doi.org/10.1007/s10681-022-02998-x
Park, J., & Kim, J. (2018). Defining heatwave thresholds using an inductive machine learning approach. PLoS ONE, 13(11), 1-11. DOI: https://doi.org/10.1371/journal.pone.0206872
R Core Team. (2022). R: A language and environment for statistical computing. Vienna, AT: R Foundation for Statistical Computing.
Resende, M. D. V., Silva, F. F., & Azevedo, C. F. (2014). Estatística matemática, biométrica e computacional: Modelos mistos, multivariados, categóricos e generalizados (REML/BLUP), inferência bayesiana, regressão aleatória, seleção genômica, QTL-GWAS, estatística espacial e temporal, competição, sobrevivência. Viçosa, MG: Editora UFV.
Rosado, R. D. S., Cruz, C. D., Barili, L. D., Souza Carneiro, J. E., Carneiro, P. C. S., Carneiro, V. Q., Silva, J.T., & Nascimento M. (2020). Artificial neural networks in the prediction of genetic merit to flowering traits in bean cultivars. Agriculture, 10(12). DOI: https://doi.org/10.3390/agriculture10120638
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408. DOI: https://doi.org/10.1037/H0042519
Sant’Anna, I. C., Nascimento, M., Silva, G. N., Cruz, C. D., Azevedo, C. F., Gloria, L. S., & Silva, F. F. (2020a). Genome-enabled prediction of genetic values for using radial basis function neural networks. Functional Plant Breeding Journal, 1(2), 1-8. DOI: https://doi.org/10.35418/2526-4117/v1n2a1
Sant’Anna, I. C., Silva, G. N., Nascimento, M., & Cruz, C. D. (2020b). Subset selection of markers for the genome-enabled prediction of genetic values using radial basis function neural networks. Acta Scientiarum. Agronomy, 43(1), 1-10. DOI: https://doi.org/10.4025/actasciagron.v43i1.46307
Silva, G. N., Sant’Anna, I. C., Cruz, C. D., Nascimento, M., Azevedo, C. F., & Gloria, L. S. (2022). Neural networks and dimensionality reduction to increase predictive efficiency for complex traits. Genetics and Molecular Research, 21(1), 1-13. DOI: https://doi.org/10.4238/gmr18982
Silveira, L. S., Lima, L. P., Nascimento, M., Nascimento, A. C. C., & Silva, F. F. (2020). Regression trees in genomic selection for carcass traits in pigs. Genetics and Molecular Research, 19(1), 1-11. DOI: https://doi.org/10.4238/GMR18498
Song, H., & Hu, H. (2022). Strategies to improve the accuracy and reduce costs of genomic prediction in aquaculture species. Evolutionary Applications, 15(4), 578-590. DOI: https://doi.org/10.1111/eva.13262
Sousa, I. C., Nascimento, M., Sant’anna, I. C., Caixeta, E. T., Azevedo, C. F., Cruz, C. D., ... Vergara Lopes Serão, N. (2022). Marker effects and heritability estimates using additive-dominance genomic architectures via artificial neural networks in Coffea canephora. PLoS ONE, 17(1), 1-14. DOI: https://doi.org/10.1371/journal.pone.0262055
Sousa, I. C., Nascimento, M., Silva, G. N., Nascimento, A. C. C., Cruz, C. D., Almeida, D. P., ... Caixeta, E. T. (2020). Genomic prediction of leaf rust resistance to Arabica coffee using machine learning algorithms. Scientia Agricola, 78(4), 1-8. DOI: https://doi.org/10.1590/1678-992X-2020-0021
Voss-Fels, K. P., Cooper, M., & Hayes, B. J. (2019). Accelerating crop genetic gains with genomic selection. Theoretical and Applied Genetics, 132, 669-686. DOI: https://doi.org/10.1007/s00122-018-3270-8
Westhues, C. C., Mahone, G. S., Silva, S., Thorwarth, P., Schmidt, M., Richter, J. C., … Beissinger, T. M. (2021). Prediction of maize phenotypic traits with genomic and environmental predictors using gradient boosting frameworks. Frontiers in Plant Science, 12(699589), 1-22. DOI: https://doi.org/10.3389/FPLS.2021.699589
Xu, Y., Zhang, X., Li, H., Zheng, H., Zhang, J., Olsen, M. S., … Qian, Q. (2022). Smart breeding driven by big data, artificial intelligence and integrated genomic-enviromic prediction. Molecular Plant, 15(11), 1664-1695. DOI: https://doi.org/10.1016/j.molp.2022.09.001
Zabihi, M., Pourghasemi, H. R., Motevalli, A., & Zakeri, M. A. (2019). Gully erosion modeling using gis-based data mining techniques in Northern Iran: A comparison between boosted regression tree and multivariate adaptive regression spline. In H. R. Pourghasemi, & M. Rossi (Eds.), Natural hazards gis-based spatial modeling using data mining techniques (p. 1-26). Cham, GE: Springer. DOI: https://doi.org/10.1007/978-3-319-73383-8_1
DECLARAÇÃO DE ORIGINALIDADE E DIREITOS AUTORAIS
Declaro que o presente artigo é original, não tendo sido submetido à publicação em qualquer outro periódico nacional ou internacional, quer seja em parte ou em sua totalidade.
Os direitos autorais pertencem exclusivamente aos autores. Os direitos de licenciamento utilizados pelo periódico é a licença Creative Commons Attribution 4.0 (CC BY 4.0): são permitidos o compartilhamento (cópia e distribuição do material em qualqer meio ou formato) e adaptação (remix, transformação e criação de material a partir do conteúdo assim licenciado para quaisquer fins, inclusive comerciais.
Recomenda-se a leitura desse link para maiores informações sobre o tema: fornecimento de créditos e referências de forma correta, entre outros detalhes cruciais para uso adequado do material licenciado.