Data variability in the imputation quality of missing data
Resumo
Imputation methods were developed to define estimates for missing data and hence solve possible problems generated by the loss of this information. This study aims to assess whether data variability influences the results obtained after applying an imputation method. Incomplete databases were generated from complete real databases of experiments of tomato plants conducted using the randomized block design with three replications and 12 treatments by removing different amounts of data. The evaluated variables consisted of fruit weight per plant, number of fruits per plant, and average fruit length and width, forming eight balanced databases. Subsequently, the distribution-free multiple imputation method was applied, generating complete databases from imputation. The number of missing information influenced the accuracy measures for the data in this study. Data imputation was inadequate when there was high variability but more precise and accurate in cases of low variability. It confirmed the importance of assessing data variability before choosing to apply the imputation method.
Downloads
Referências
Austin, P. C., White, I. R., Lee, D. S., & van Buuren, S. (2021). Missing data in clinical research: A tutorial on multiple imputation. Canadian Journal of Cardiology, 37(9), 1322-1331 DOI: https://doi.org/10.1016/j.cjca.2020.11.010
Banzatto, D. A., & Kronka, S. N. (2013). Experimentação agrícola (4. ed.). Jaboticabal, SP: Funep.
Bergamo, G. C., Dias, C. T. S., & Krzanowski, W. J. (2008). Distribuition-free multiple imputation in an interaction matrix through singular value decomposition. Scientia Agricola, 65(4), 422-427. DOI: https://doi.org/10.1590/S0103-90162008000400015
Bleidorn, M. T., Pinto, W. P., Schmidt, I. M., Mendonça, A. S. F., & Reis, J. A. T. (2022). Methodological approaches for imputing missing data into monthly flows series. Revista Ambiente & Água, 17(2), 1-27. DOI: https://doi.org/10.4136/ambi-agua.2795
Boomgard-Zagrodnik, J. P., & Brown, D. J. (2022). Machine learning imputation of missing Mesonet temperature observations. Computers and Electronics in Agriculture, 192, 106580. DOI: https://doi.org/10.1016/j.compag.2021.106580
Enders, C. K. (2010). Applied missing data analysis (2. ed). New York, NY: The Guilford Press. Retrieved on July 12, 2021 from http://hsta559s12.pbworks.com/w/file/fetch/52112520/enders.applied
Eze, F. C., & Chukwunenye, V. G. (2019). Comparing methods of estimating missing values in one-way analysis of variance. International Journal of Trend in Scientific Research and Development, 3(2), 994-1000. DOI: https://doi.org/10.31142/ijtsrd18599
Filgueira, F. A. R. (2008). Novo manual de olericultura: agrotecnologia moderna na produção e comercialização de hortaliças. Viçosa, MG: UFV.
Jinubala, V., & Jeyakumar, P. (2021). Methodologies for imputation of missing values in rice pest data. Current Journal of Applied Science and Technology, 40(5), 64-73. DOI: https://doi.org/10.9734/cjast/2021/v40i531304
Kang, H. (2013). The prevention and handling of the missing data. Korean Journal Anesthesiology, 64(5), 402-406. DOI: https://doi.org/10.4097/kjae.2013.64.5.402
Lall, R. (2016). How multiple imputation makes a difference. Political Analysis, 24(4), 414-433. DOI: https://doi.org/10.1093/pan/mpw020
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. Journal of Educational Statistics, 16(2), 150-155. DOI: https://doi.org/10.2307/1165119
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New Jersey, NY: John Wiley & Sons Inc.
Moreno, J. A. (1961). Clima no Rio Grande do Sul. Porto Alegre, RS: Secretaria da Agricultura.
Ni, D., Leonard, J. D., Guin, A., & Feng, C. (2005). Multiple imputation scheme for overcoming the missing values and variability issues in ITS data. Journal of Transportation Engineering, 131(12), 931-938. DOI: https://doi.org/10.1061/(asce)0733-947x(2005)131:12(931)
Pedersen, A. B., Mikkelsen, E. M., Cronin-Fenton, D., Kristensen, N. R., Pham, T. M., Pedersen, L., & Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clinical Epidemiology, 9, 157-166. DOI: https://doi.org/10.2147/CLEP.S129785
Peng, W., Lei, Y., & Junyi, Z. (2022). Research on missing data filling method of wind power generation based on k-nearest neighbor algorithm. In 5th International Conference on Data Science and Information Technology (DSIT). Shanghai, CH, IEEE. DOI: https://doi.org/10.1109/DSIT55514.2022.9943846
Pimentel Gomes, F. (1985). Curso de estatística experimental. São Paulo, SP: Nobel.
R Core Team. (2017). R: A language and environment for statistical computing. Vienna, AT: R Foundation for Statistical Computing. Retrieved on July 12, 2021 from https://www.R-project.org/
RStudio Team (2009-2017). RStudio: Integrated Development for R. Boston, MA: RStudio, Inc. Retrieved on July 12, 2021 from http://www.rstudio.com/
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592. DOI: https://doi.org/10.1093/biomet/63.3.581
Salgado, C. M., Azevedo, C., Proença, H., & Vieira, S. M. (2016). Missing data. In Secondary analysis of electronic health records. Cambridge, US: Springer. DOI: https://doi.org/10.1007/978-3-319-43742-2_13.
Santos, C., & Dias, C. (2021). Note on the coefficient of variation properties. Brazilian Electronic Journal of Mathematics, 2(4), 101-111. DOI: https://doi.org/10.14393/BEJOM-v2-n4-2021-58062
Santos, H. G., Jacomine, P. K. T., Anjos, L. H. C., Oliveira, V. A., Oliveira, J. B., Coelho, M. R., ... Cunha, T. J. F. (2006). Sistema brasileiro de classificação de solos (2. ed.). Rio de Janeiro, RJ: Embrapa Solos.
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147-177. DOI: https://doi.org/10.1037/1082-989X.7.2.147
Sociedade Brasileira de Ciência do Solo. (2004). Manual de adubação e de calagem para os Estados do Rio Grande do Sul e de Santa Catarina. Núcleo Regional Sul. Porto Alegre, RS: Comissão de Química e Fertilidade do Solo - RS/SC.
Stochero, E. L. M., Jacobi, L. F., & Lúcio, A. D. (2020). Imputação de dados na análise de variância em experimentos no delineamento inteiramente casualizado. Ciência e Natura, 42, 1-13. DOI: https://doi.org/10.5902/2179460X40446
Yu, L., Zhou, R., Chen, R., & Lai, K. K. (2020). Missing data preprocessing in credit classification: One-hot encoding or imputation? Emerging Markets Finance and Trade, 58(2), 472-482. DOI: https://doi.org/10.1080/1540496X.2020.1825935
DECLARAÇÃO DE ORIGINALIDADE E DIREITOS AUTORAIS
Declaro que o presente artigo é original, não tendo sido submetido à publicação em qualquer outro periódico nacional ou internacional, quer seja em parte ou em sua totalidade.
Os direitos autorais pertencem exclusivamente aos autores. Os direitos de licenciamento utilizados pelo periódico é a licença Creative Commons Attribution 4.0 (CC BY 4.0): são permitidos o compartilhamento (cópia e distribuição do material em qualqer meio ou formato) e adaptação (remix, transformação e criação de material a partir do conteúdo assim licenciado para quaisquer fins, inclusive comerciais.
Recomenda-se a leitura desse link para maiores informações sobre o tema: fornecimento de créditos e referências de forma correta, entre outros detalhes cruciais para uso adequado do material licenciado.