Assessment of the use of statistical methods in articles published in a journal of veterinary science from 2000 to 2010

Statistics is a key tool to validate the conclusions of scientific papers. However, errors in using this method, including the use of low power tests and inadequate analysis of the studies are still frequent. This research identified, through a census of 307 articles published between 2000 and 2010 in the Journal Archives of Veterinary Science, that 34% of the papers had made conclusions without statistical support, 34% were supported by statistical methods inadequate to the evaluated database, whereas only 32% have presented conclusions based on statistical methods consistent with the structure of the analyzed data. Furthermore, the percentage of inadequate conclusions may be even higher, of up to 47%, since some of the articles that had not used statistical analysis should have applied some method for the validation of their conclusions. The results presented herein warn against the misuse of statistical methods that compromise the quality and reliability of the conclusions presented in most papers.


Introduction
Statistics is a tool that validates the conclusions of scientific studies. Nevertheless, several conclusions are questionable, if not erroneous, when this tool is not properly employed (CONCEIÇÃO, 2008). The adequate application of statistical methods should cover: a complete description of the used methodology; an appropriate use of statistical methodology, and; interpretations and conclusions limited to the technique used (WHITE, 1979).
In general, there is no certainty about the validity of scientific conclusions. The function of using statistical methods is to determine the margin of error associated with conclusions, based on the knowledge of the variability observed in the results. In this way, an improper application may undermine the validity of the study, leading to untrue conclusions (CALLEGARI-JACQUES, 2003).
In any scientific study, the employed statistical tests should be adequate and sufficiently described, so that their application can be critically evaluated. The lack or improper use of statistical methods are factors that affect the scientific development (SCHWARTZ et al., 1997).
According to Glickman (2010), errors occur because the researchers only care about statistics after they have gotten the data, or only after the work had returned to correct the statistical analysis. To avoid this, Parker (2000) suggests the participation of a statistician in the research team since the stage of experimental design.
The present study aimed to describe the characteristics of the use of statistical methods in papers published in a national journal of veterinary science (Archives of Veterinary Science), from 2000 to 2010, in order to identify and report possible failures in the application of these methods.

Material and methods
The data used to execute this study came from a census performed in the full articles published in the journal Archives of Veterinary Science, between 2000 and 2010 (volumes 5-15). We did not consider the papers published in special editions or appendix (e.g. symposium proceedings). The articles were classified regarding the year of publication, study area, type of article, and source of data presented.
It was considered as application of mathematical methods every article that somehow used numerical data to express the results. On the other hand, the application of statistical methods was considered when the authors used a statistical tool to describe, evaluate, and/or understand the numerical data from the results of their studies.
For each article, it was accounted and discriminated complete information about the use of mathematical and/or statistical methods, statistical software, presentation form of numerical results, data transformation, sampling plan, experimental design, degree of reliability of statistical tests, and the statistical methods applied. These methods were divided into: descriptive statistics, frequency study, analysis of variance, means comparison test, non-parametric tests, regression, correlation, multivariate statistics, metaanalysis (combines in a single summary measure the results of independent studies), and modeling (development and application of equations of prediction, calibration, validation; study of residuals, and others).
Based on the precepts of the use of each one of these statistical methods, as presented in literature (CALLEGARI-JACQUES, 2003;FERREIRA, 2005GOMES, 2009;STORCK, 2000), the methodology employed by the authors was classified adequate or inadequate. It was also considered the main statistical method by which the articles have based their respective conclusions. To evaluate the articles, it was considered the information about the statistical methods described and presented.
The data generated in this analysis were grouped into a database using the Microsoft Access 2010, and the statistical analyses processed in the software package Statistica v.8. Then, the data were subjected to a descriptive analysis and compared statistically using the Chi-Square test (Fisher's Exact Test), with 95% reliability.

Results and discussion
Between 2000 and 2010, 307 articles had been published in the journal Archives of Veterinary Science. The description of the results is listed in Table 1. Most of the articles (36.5%) was published in the area of animal production, followed by articles in clinic/surgery (27.0%) and animal reproduction (17.6%) ( Table 1). The other studies were categorized into the areas of epidemiology, anatomy/histology, physiology/biochemistry, animal welfare, public health, economy/rural administration, totaling 18.9% of the articles (Table 1).
Of the total, 52.8% of the articles obtained the results from experiments in controlled environment or from field; 31.3% from technical evaluations of field data (population or sample); 10.4% from clinical cases, and 5.6% from other data sources (literature review, questionnaires, interviews or diverse censuses) ( Table 1).
Regarding the use of mathematical and/or statistical methods, 66.5% presented both methods; 16.6%, only mathematical methods without statistical support; and 16.9%, no mathematical or statistical method. Therefore, 33.6% of the published articles did not present any conclusions based on statistical analysis.
The non-indication of the applied test can be a sign of poor analyses, neglect or ignorance of experimentation by the researcher, considering that the diffusion of experimental techniques implies that at least the author is able to describe what had been accomplished (LÚCIO et al., 2003).
From the features provided by statistical software and the wider diffusion of planning techniques and analysis of experiments, the researchers have increasingly conditions to examine properly the data from experiments, ensuring reliability to the results and conclusions (CONAGIN et al., 2008;GOMES, 2009). Still, in the present study, it was verified that 49.0% of the articles that applied statistical tests did not present whether they used any statistical program. Among those that presented, SAS package was the most used (56.7%).
As suggested by Hokanson et al. (1987), the editorial board of the journals should adopt a minimum standard format to describe the statistical techniques, including: size of the sample, degree of reliability, experimental design, used technique, and significance level.
Among the articles that applied any sampling plan (212 articles), 88.2% presented the N of samples, but without description of the statistical criterion employed to determine it. The rest had not even showed how many samples were used. For Carratore (2006) the sample size alone does not determine whether it is of good or bad quality. More important than its size is its representativeness, i.e., the degree of similarity with the population under study. Therefore, all the groups must appear in the sample, with a proportion very close to the studied population.
In the universe analyzed, none article published had presented statistical base for adequate determination of sample size based on the variability of data obtained and on the expected reliability, as suggested by Gomes (2009). Without the use of a statistical method for this purpose, there are great chances, for instance, of using an inadequate number of individuals, above or below the necessary for inference of the results, exposing the research to implications of experimental ethics.
Among the articles with statistical tests, 15 (7.4%) had not presented at any time (nor within the text, tables or graphs) the degree of reliability or probability of error of the tests.
Half of the articles that had obtained the data from field experiments or in controlled environment (81 articles from 162) had not presented which the experimental design was used. Among the experiments that had some design, the completely randomized was the most frequent, with 18.5%. The use of experimental design enables avoiding biases in conclusions of experiments with several simultaneous treatments, by the planning of them in a way to prevent systematic influences from variables on the studied effects, validating the methods and hypotheses (ALBERTON et al., 2011).
Among these articles with some statistical method employed, the analyses used as the main statistical support for the conclusions were the means comparison test (44.1%), non-parametric tests (16.7%), regression analysis (13.7%), and analysis of variance (9.8%). The other studies (15.7%) used other statistical techniques to represent and evaluate numerical results, as correlation, descriptive statistics, frequency study, and multivariate analysis. None of the articles analyzed had presented conclusions based on meta-analysis of results from other scientific studies, as well as from mathematical modeling or even the tools of development and validation of models (Table 2). Of the total, 45.0% of the articles have mentioned the use of analysis of variance (ANOVA), which according to Barbin (1993) is the most frequent statistical method in any kind of experiment. The ANOVA aims at estimating the components of variance that have great importance in genetic improvement breeding, whether animal or vegetal (BANZATTO; KRONKA, 1995).
In agreement with Carneiro (2003), several authors do not present the table of the analysis of variance, or any measure of central tendency of the different groups, nor discuss if the difference found is technically important, even being statistically significant. In the present study, 41.0% of the articles did not present any result and only 3.9% of the articles had the complete results of ANOVAthe others 55.1% had not reported the results properly.
In order to verify if the ANOVA assumptions are being achieved, it can be used the following tests: (a) Tukey's additivity test, verifies if the effects of the mathematical model are additive (SNEDECOR; COCHRAN, 1967); (b) sequence test, checks the randomness of the errors, i.e., their independence (BEAVER et al., 1974); (c) Lilliefors test, verifies the normality in the errors distribution (CAMPOS, 1983), (d) Bartlett test, checks the variance homogeneity (homoscedasticity) between the treatments (STEEL; TORRIE, 1960), (e) Shapiro-Wilk test, examines the adherence of the results to the Gaussian normal curve (GUO et al., 2010) and, finally, (f) from the analysis of variance, ensure a minimum of 12 degrees of freedom of the residual (GOMES, 2004).
When the assumptions are not reached, it must be used the non-parametric analysis or proceed a transformation of the data (LOPES; STORCK, 1995). Only 9.4% of the articles had carried out some kind of mathematical transformation on the original data, and from these, only 10 articles (3.3%) made it aiming to improve the statistical evaluation of the data; the other studies used pre-defined transformations, based on literature.
Among the articles that had employed the ANOVA, only 18.8% had described the use of any prerequisites or assumptions for its appropriate application (at least normality, homoscedasticity and minimum degrees of freedom of the residual). Thus, 81.2% of the articles that used the ANOVA as a final base for conclusion or as a part of tests comparing the means, also have made it incorrectly -representing 112 articles or 36.5% of the total.
A total of 114 articles (37.1% among all, and 55.9% among those with some statistical test) had accomplished means comparison tests (MCT), and Tukey's test was the most frequent (47.4% among those with some MCT). According to Cardellino and Siewerdt (1992), the MCT are often employed indiscriminately, prejudicing the conclusions taken from the results, since it fails to obtain information about intermediate treatments, such as technically significant differences and points of maximum technical and economic efficiency.
In the same way as observed for the ANOVA, 86.7% of the articles had performed MCT inadequately. Among the most common errors, besides those abovementioned as prerequisites for the ANOVA, especially the normality, which allow the use of a mean to represent the population or sample, there is also use of them to compare nominal data, scores, counting and percentage, which should be compared through non-parametric tests (FERREIRA, 2005). Moreover, 27 articles described the use of t-test for comparison between more than two means (multiple). Furthermore, Petersen (1977) verified that 40% of authors surveyed have used some type of means comparison test, and among them, 40% used it in an entirely inappropriate way as for the type of data. According to the same author, the mean comparison tests (Tukey, Bonferroni, etc.) are adequate for situations when the treatments are levels not related of a qualitative factor. Bertoldo et al. (2008) evaluated the use of MCT in unifactorial and factorial experiments, in scientific articles. In the unifactorial experiments, 48, 26 and 26% were classified regarding the use as: appropriate, partially appropriate and inappropriate, respectively. In the factorial experiments, 79% of the articles were considered inappropriate in relation to the MCT, while 17% was appropriate, and 4%, partially appropriate.
The inadequate choice of the MCT results in incomplete or misleading statements. For instance, when the treatments or factors have qualitative traits, the MCT can be adequately applied, however, the regression analysis for different models should be used when the treatments are quantitative (BANZATTO; KRONKA, 1995). In this study, 35 articles (11.4%) used the regression analysis for the comparison of results from quantitative treatments, and 14 publications (4.6%) used the MCT to compare quantitative groups.
Over the ten years of publication evaluated herein, 33.6% of the articles have presented conclusions without any statistical support, 34.2% with statistical support, but with inappropriate application, and 32.2% with conclusions based on appropriate use of statistics. Besides that, those that had not used statistics, 37.9% could clearly have subjected their data to diverse statistical analyses, which increase the number of conclusions obtained improperly to 46.9% (114) of the articles published in the period. Evaluating the results per year, there is no significant difference (p > 0.05) for the proportion of studies with conclusions (a) without statistical support, (b) with statistical support, but with inadequate application, or (c) with adequate statistical support, between the eleven years, from the comparison by the chi-square test. By making a contrast between the proportions, applying the same previous test, of the last four years of publications (2007 to 2010) at the expense of the earlier years (2000 to 2006), there was a significant increase (p < 0.05) in the number of articles without statistical analysis, from 27.6 to 43.8% to the latest publications.
For the same contrast between the periods above presented, no significant change (p > 0.05) was found in the proportion between articles with adequate or inadequate use of statistical methods. Nevertheless, the number of studies with conclusions without statistical support that could have their data evaluated statistically, had increased significantly (p < 0.05) in the last four years, increasing from 32. 6% (2000 to 2006) to 47.8% (2007 to 2010), when applying the chi-square test.
According to Lee (2010), 51% of the articles published in biomedical journals had used incorrect statistical methodology, but this percentage may be even higher, since 16% of the articles did not specify the methodology and could have used inadequate processes. Among those papers that had employed statistics in this same area, the percentage that used improper methods increased from 22 to 46%, from 198546%, from to 1995ZHANG, 1998).
The results in the present study indicate the need for greater attention by researchers, reviewers and editors, regarding the statistical methods applied. With the adequate use of these methods, the data from experiments and any other technical evaluations have conditions to be evaluated properly, providing greater reliability to the results and conclusions (GLICKMAN et al., 2010). In the end, the goal of its use is to ensure the absence of ambiguity in the empirical reference of the concepts used by the researchers, and provide security for readers to infer from the published results.

Conclusion
There is evident deficiency of careful verification of the suitability of the statistical methods used to the types of studied treatments and data obtained.
In this way, it is necessary that the journal editors require the correct application of statistics, since a great part of published articles have conclusions based on inadequate techniques or simply without the required statistical scrutiny.
The researchers should know enough statistics for the correct analysis of their studies, mainly when they need simpler tests and recognize the situations when they need the collaboration of a statistician.