Data variability in the imputation quality of missing data

Palavras-chave: missing data; data imputation; randomized block design; distribution-free multiple imputation.

Resumo

Imputation methods were developed to define estimates for missing data and hence solve possible problems generated by the loss of this information. This study aims to assess whether data variability influences the results obtained after applying an imputation method. Incomplete databases were generated from complete real databases of experiments of tomato plants conducted using the randomized block design with three replications and 12 treatments by removing different amounts of data. The evaluated variables consisted of fruit weight per plant, number of fruits per plant, and average fruit length and width, forming eight balanced databases. Subsequently, the distribution-free multiple imputation method was applied, generating complete databases from imputation. The number of missing information influenced the accuracy measures for the data in this study. Data imputation was inadequate when there was high variability but more precise and accurate in cases of low variability. It confirmed the importance of assessing data variability before choosing to apply the imputation method.

Downloads

Não há dados estatísticos.

Referências

Austin, P. C., White, I. R., Lee, D. S., & van Buuren, S. (2021). Missing data in clinical research: A tutorial on multiple imputation. Canadian Journal of Cardiology, 37(9), 1322-1331 DOI: https://doi.org/10.1016/j.cjca.2020.11.010

Banzatto, D. A., & Kronka, S. N. (2013). Experimentação agrícola (4. ed.). Jaboticabal, SP: Funep.

Bergamo, G. C., Dias, C. T. S., & Krzanowski, W. J. (2008). Distribuition-free multiple imputation in an interaction matrix through singular value decomposition. Scientia Agricola, 65(4), 422-427. DOI: https://doi.org/10.1590/S0103-90162008000400015

Bleidorn, M. T., Pinto, W. P., Schmidt, I. M., Mendonça, A. S. F., & Reis, J. A. T. (2022). Methodological approaches for imputing missing data into monthly flows series. Revista Ambiente & Água, 17(2), 1-27. DOI: https://doi.org/10.4136/ambi-agua.2795

Boomgard-Zagrodnik, J. P., & Brown, D. J. (2022). Machine learning imputation of missing Mesonet temperature observations. Computers and Electronics in Agriculture, 192, 106580. DOI: https://doi.org/10.1016/j.compag.2021.106580

Enders, C. K. (2010). Applied missing data analysis (2. ed). New York, NY: The Guilford Press. Retrieved on July 12, 2021 from http://hsta559s12.pbworks.com/w/file/fetch/52112520/enders.applied

Eze, F. C., & Chukwunenye, V. G. (2019). Comparing methods of estimating missing values in one-way analysis of variance. International Journal of Trend in Scientific Research and Development, 3(2), 994-1000. DOI: https://doi.org/10.31142/ijtsrd18599

Filgueira, F. A. R. (2008). Novo manual de olericultura: agrotecnologia moderna na produção e comercialização de hortaliças. Viçosa, MG: UFV.

Jinubala, V., & Jeyakumar, P. (2021). Methodologies for imputation of missing values in rice pest data. Current Journal of Applied Science and Technology, 40(5), 64-73. DOI: https://doi.org/10.9734/cjast/2021/v40i531304

Kang, H. (2013). The prevention and handling of the missing data. Korean Journal Anesthesiology, 64(5), 402-406. DOI: https://doi.org/10.4097/kjae.2013.64.5.402

Lall, R. (2016). How multiple imputation makes a difference. Political Analysis, 24(4), 414-433. DOI: https://doi.org/10.1093/pan/mpw020

Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. Journal of Educational Statistics, 16(2), 150-155. DOI: https://doi.org/10.2307/1165119

Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New Jersey, NY: John Wiley & Sons Inc.

Moreno, J. A. (1961). Clima no Rio Grande do Sul. Porto Alegre, RS: Secretaria da Agricultura.

Ni, D., Leonard, J. D., Guin, A., & Feng, C. (2005). Multiple imputation scheme for overcoming the missing values and variability issues in ITS data. Journal of Transportation Engineering, 131(12), 931-938. DOI: https://doi.org/10.1061/(asce)0733-947x(2005)131:12(931)

Pedersen, A. B., Mikkelsen, E. M., Cronin-Fenton, D., Kristensen, N. R., Pham, T. M., Pedersen, L., & Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clinical Epidemiology, 9, 157-166. DOI: https://doi.org/10.2147/CLEP.S129785

Peng, W., Lei, Y., & Junyi, Z. (2022). Research on missing data filling method of wind power generation based on k-nearest neighbor algorithm. In 5th International Conference on Data Science and Information Technology (DSIT). Shanghai, CH, IEEE. DOI: https://doi.org/10.1109/DSIT55514.2022.9943846

Pimentel Gomes, F. (1985). Curso de estatística experimental. São Paulo, SP: Nobel.

R Core Team. (2017). R: A language and environment for statistical computing. Vienna, AT: R Foundation for Statistical Computing. Retrieved on July 12, 2021 from https://www.R-project.org/

RStudio Team (2009-2017). RStudio: Integrated Development for R. Boston, MA: RStudio, Inc. Retrieved on July 12, 2021 from http://www.rstudio.com/

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592. DOI: https://doi.org/10.1093/biomet/63.3.581

Salgado, C. M., Azevedo, C., Proença, H., & Vieira, S. M. (2016). Missing data. In Secondary analysis of electronic health records. Cambridge, US: Springer. DOI: https://doi.org/10.1007/978-3-319-43742-2_13.

Santos, C., & Dias, C. (2021). Note on the coefficient of variation properties. Brazilian Electronic Journal of Mathematics, 2(4), 101-111. DOI: https://doi.org/10.14393/BEJOM-v2-n4-2021-58062

Santos, H. G., Jacomine, P. K. T., Anjos, L. H. C., Oliveira, V. A., Oliveira, J. B., Coelho, M. R., ... Cunha, T. J. F. (2006). Sistema brasileiro de classificação de solos (2. ed.). Rio de Janeiro, RJ: Embrapa Solos.

Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147-177. DOI: https://doi.org/10.1037/1082-989X.7.2.147

Sociedade Brasileira de Ciência do Solo. (2004). Manual de adubação e de calagem para os Estados do Rio Grande do Sul e de Santa Catarina. Núcleo Regional Sul. Porto Alegre, RS: Comissão de Química e Fertilidade do Solo - RS/SC.

Stochero, E. L. M., Jacobi, L. F., & Lúcio, A. D. (2020). Imputação de dados na análise de variância em experimentos no delineamento inteiramente casualizado. Ciência e Natura, 42, 1-13. DOI: https://doi.org/10.5902/2179460X40446

Yu, L., Zhou, R., Chen, R., & Lai, K. K. (2020). Missing data preprocessing in credit classification: One-hot encoding or imputation? Emerging Markets Finance and Trade, 58(2), 472-482. DOI: https://doi.org/10.1080/1540496X.2020.1825935

Publicado
2024-04-03
Como Citar
Stochero, E. L. M., Dal’Col Lúcio, A., & Jacobi, L. F. (2024). Data variability in the imputation quality of missing data. Acta Scientiarum. Agronomy, 46(1), e66185. https://doi.org/10.4025/actasciagron.v46i1.66185
Seção
Biometria, Modelagem e Estatística

 

2.0
2019CiteScore
 
 
60th percentile
Powered by  Scopus

 

2.0
2019CiteScore
 
 
60th percentile
Powered by  Scopus