Type I error in multiple comparison tests in analysis of variance

ABSTRACT. In a hypothesis test, a researcher initially fixes a type I error rate, that is, the probability of rejecting the null hypothesis given that it is true. In the case of means tests, it is important to present a type I error that is equal to the nominal pre-fixed level, such that this error remains unchanged across various scenarios, including the number of treatments, number of repetitions, and coefficient of variation. The purpose of this study is to analyse and compare the following multiple comparison tests for the control of both conditional and unconditional type I error rates, depending on a significant F-test in the analysis of variance: Tukey, Duncan, Fisher’s least significant difference, Student-Newman-Keuls (SNK), and Scheffé. As an application, we present a motivation study and develop a simulation study using the Monte Carlo method for a total of 64 scenarios. In each simulated scenario, we estimate the comparison-wise and experiment-wise error rates, conditional and unconditional on a significant result of the overall F-test of analysis of variance for each of the five multiple comparison tests evaluated. The results indicate that the application of the means tests based only on the significance of the F-test should be considered when determining the error rates, as this can change them. In addition, we find that Fisher’s test controls for the comparison-wise error rate, the Tukey and SNK tests control for the experiment-wise error rate, and the Duncan and Fisher tests control for the conditional experiment-wise error rate. Scheffé’s test does not control for any of the error rates considered.


Introduction
In agricultural experiments, a common problem arises while conducting a comparison among treatments of interest to determine the existence of a difference between them.The most common solution to this problem lies in the application of the analysis of variance (ANOVA) (Girardi, Cargnelutti Filho, & Storck, 2009).
The overall F-test in ANOVA checks the hypothesis of equality for the population means of the treatments.If the F-test is significant, then a comparison of means test is performed to investigate the possible differences between the pairs of specific means or a linear combination of them (Saville, 2014).
One of the dilemmas involved in the means tests is their conditional application to a significant F-test.According to Cardellino and Siewerdt (1992), this is a controversial question and should be investigated further.Rodrigues, Piedade, and Lara (2016), for example, noticed divergent results between the overall F-test and the means tests evaluated in their simulation study.In this motivational study, we present the agronomic experiment of Henrique and Laca-Buendía (2010), which compares five cultivars and a new genotype of cotton, and show divergent results between the overall F-test and certain means tests commonly used in agricultural research.Nevertheless, many authors recommend applying means tests only on a significant result of the F-test in ANOVA.Therefore, many questions remain to be answered in this field of study.However, it is not possible to dissociate this study from the errors that can occur in a hypothesis test.This is because when the hypothesis for an average contrast is analysed, the test, whether or not applied based only on a significant result of the overall F-test, exhibits the probabilities of type I and type II errors, where the type I error rate can be of the comparison-wise or experiment-wise type (Ramos & Vieira, 2014).The comparison-wise error rate is the long-run proportion of the number of erroneous inferences observed divided by the total number of comparisons made; the experiment-wise error rate is the long-run proportion of the number of experiments conducted with at least one erroneous inference divided by the total number of experiments (Boardman & Moffitt, 1971).
Several studies have been conducted to evaluate the means tests in relation to the type I error rate control and to propose modifications in the tests that are aimed at controlling this rate (Biase & Ferreira, 2011;Souza, Lira Junior, & Ferreira, 2012;Gonçalves, Ramos, & Avelar, 2015).However, the analysis of these concepts associated with the significance of the F-test requires an additional investigation.
In this context, the aim of this study is to analyse and compare the Tukey, Duncan, Fisher's least significant difference (Fisher's LSD), Student-Newman-Keuls (SNK), and Scheffé tests, which are used for conducting pair-wise comparisons between the means, with respect to the control of type I error rates, which are conditional and unconditional to a significant result of the overall ANOVA F-test.

Motivational study
We present an experiment conducted by Henrique and Laca-Buendía (2010), in which the aim was to compare five cultivars and a new genotype of cotton (Gossypium hirsutum L. r. latiFolium Hutch).The experiment was conducted in Uberaba, Minas Gerais State, Brazil, located at longitude 47°57'22" WGR, latitude 19°44'6.82"S, and an altitude of 775 m.
The experiment was carried out in a randomised block design with six treatments and four replicates.The plots consisted of four lines of 5 m with a 0.7 m spacing between the lines.The two central lines were considered to have a useful area of 3.5 m²; the other two lines, one on each side, were the borders, with each line being an equivalent of a total of 10 plants.In agricultural experiments, it is quite common to use a number of repetitions equal to four, since the plots cover larger areas and more than one 'individual' is used per plot.
In the experiment, the following varieties were compared: Delta Opal (Delta and Pine), Delta Penta (Delta and Pine), BRS-Cedro (EMBRAPA), IAC-25 (IAC), EPAMIG Precoce I, and the progeny IAC-06/191 (IAC).The height of the first productive branch (average distance from the soil to the first branch in which there were bolls, in centimetres) and final stand (total number of plants at harvest time) were among the evaluated traits.
In the context of the variable height of the first productive branch, the overall ANOVA F-test is found to be significant at the 5% level, but the Scheffé test shows no difference between the means of the treatments at the same level of significance (Table 1).For the variable final stand, the ANOVA F-test is not significant, but Duncan's and Fisher's LSD tests show differences between some of the treatment means (Table 2).These results serve as the basis for establishing scenarios to study the type I error rates presented here, since the means tests can be applied irrespective of whether the F-test is significant or not.A completely randomised design was used to facilitate the simulation process, although the results can be applied to any other type of experimental design.

Study of simulation
A total of 128,000 experiments were simulated using the Monte Carlo method with 2,000 for each scenario for a total of 64 cases consisting of a combination of the following factors: 3, 5, 7, or 9 treatments; 3, 4, 10, or 20 replicates; and the coefficient of variation (CV) of 1, 5, 10, or 20%, without considering the treatment effect.The experiments were simulated using a completely randomised design where y ij represents the simulated value of the response obtained with the i-th treatment in its j-th repetition, µ is the mean that is arbitrarily set at 100, τi is the fixed effect of the i-th treatment (which is considered to be null), and ε ij is the random error generated independently with a normal distribution having zero mean and a standard deviation (σ) varying according to the desired CV.In all simulated scenarios, we ensured that the analyses were conducted based on the same randomly used seed, such that possible differences did not occur due to a random error of the simulation process, but rather due to the differences between the tests.Moreover, the nominal significance level adopted in all the cases was 5%.To simulate the experimental data for performing statistical analyses, an algorithm was developed using the R software (R Core Team, 2020).
Consequently, for each simulated scenario, the type I error rates were estimated by the comparisons and experiments conducted for each of the five tests (Tukey, Duncan, Fisher's LSD, SNK, and Scheffé).Furthermore, these error rates were considered in two different aspects: (i) the multiple comparison procedure was applied regardless of the overall F-test result, and (ii) it was applied only when the F-test was significant.
The comparison-wise error rate (α c ) is defined as the ratio of the number of erroneous inferences, such that µ i ≠ µ i' when µ i = µ i' , to the total number of comparisons performed.Thus, by taking the demonstrative scenario of a = 3, r = 3, and CV = 1%, the unconditional comparison-wise error rate can be estimated from the ratio of the total number of erroneous inferences to the total number of inferences (in this case, 2,000 experiments × 3 contrasts per experiment = 6,000 contrasts).
In the case where the means tests are applied only if the overall ANOVA F-test is significant, the comparison-wise error rate, that is conditional (α 1 ), can be estimated empirically based on the total number of type I errors observed in the experiments that presented a significant F-test and the total number of inferences made.Taking the demonstrative scenario of a = 3, r = 3, and CV = 1%, while considering that out of the 2,000 experiments simulated for this scenario, only 100 showed a significant result for the overall Ftest, this error rate can be estimated by the ratio of the total number of erroneous inferences made in 100 experiments to the total number of inferences (6,000 contrasts).
A second way to estimate the conditional comparison-wise error rate (α 2 ) is to take the total number of erroneous inferences within the experiments with a significant result for the overall F-test and divide it by the total number of inferences made within these experiments.Considering the same scenario where a = 3, r = 3, and CV = 1% and the number of 100 experiments with a significant overall F-test, this rate can be estimated by the ratio of the total number of erroneous inferences made in the 100 experiments to the total number of inferences made within these experiments (in this case, 100 experiments × 3 contrasts per experiment = 300 contrasts).
Accordingly, the experiment-wise error rate (α e ) is defined as the ratio of the number of experiments with at least one erroneous inference (µ i ≠ µ i' when µ i = µ i' ) divided by the total number of experiments.Thus, considering the scenario where a = 3, r = 3, and CV = 1%, the unconditional experiment-wise error rate can be estimated by the ratio between the number of trials with at least one erroneous inference among the three tested to the total number of experiments (in this case, 2,000).
The conditional experiment-wise error rate (α 3 ) can be estimated by taking the total number of experiments that presented a significant overall F-test as well as at least one comparison resulting in a type I error, which is divided by the total number of experiments.Considering that for the scenario where a = 3, r = 3, and CV = 1%, out of the 2,000 experiments simulated, only 100 presented a significant result for the overall F-test, this error rate can be estimated by the ratio between the number of experiments (out of the 100 experiments) that presented at least one comparison resulting in a type I error and the total number of experiments (2,000).
Moreover, the same rate can be estimated based on the experiments that presented a significant result in the overall F-test.This rate (α 4 ) can then be calculated by dividing the total number of trials that presented a significant F-test and at least one comparison resulting in a type I error by the total number of trials out of the 2,000 trials with a significant F-test.For the scenario with a = 3, r = 3, and CV = 1%, considering that out of the 2,000 experiments simulated, only 100 presented a significant result for the overall F-test, this rate can be estimated by dividing the number of experiments (out of the 100) that presented at least one comparison resulting in a type I error and a total of 2,000 experiments with a significant overall F-test (in this case, 100).
To verify whether each of the rates differed from the established nominal significance level (α = 5%), a lower limit of 0.038 and an upper limit of 0.063 were used, which were calculated from a 95% confidence interval (CI) for the proportion p̂ = 0.05, expressed as Where z α/2 is the quantile value of the standard normal distribution at the α level of significance.Thus, the rates within this interval were not considered to be different from the established nominal level.

Results and discussion
In the 64 scenarios formed by the combination of the number of treatments (a = 3, 5, 7, and 9), the number of repetitions (r = 3, 4, 10, and 20), and the CV (CV = 1%, 5%, 10%, and 20%), the error rates, do not present differences with a high magnitude in the variation of the number of repetitions and the CV (Tables 3 to 6).The same results were noted by Girardi et al. (2009), who presented the comparison-wise and experimentwise error rates for 80 scenarios formed by a variation in the number of treatments, number of repetitions, and CV of the experiments.
In all the simulated scenarios, the comparison-wise error rates are lower than the experiment-wise error rates (Tables 3 to 6) for the five multiple comparison tests evaluated, which is an expected result according to Girardi et al. (2009); this is because the equality between them would be obtained only if the totality of the contrasts was significant in all the experiments with at least a significant contrast.
Regarding the comparison-wise error rates α c , the behaviour of the Tukey, SNK, and Scheffé tests are similar, since the presented estimates for this rate lie below the lower limit of the 95% CI calculated (0.0375) in all simulated scenarios (Tables 3 to 6).In general, an increase in the number of treatments causes a decrease in the error rate for these tests.
Further, the Duncan test presents the estimates for α c below the lower limit of the 95% CI in most scenarios.However, for CV = 1%, the test shows that this rate can be controlled in most scenarios where a = 3 and a = 5; while for CV = 5%, 10%, and 20%, this rate can be controlled in scenarios with a = 3.Similar to the Tukey, SNK, and Scheffé tests, an increase in the number of treatments causes a significant decrease in the error rate in the Duncan test.
For Fisher's LSD test, it is observed that the estimates obtained for the comparison-wise error rates α c always remain within the 95% CI calculated (Tables 3 to 6).In this case, the variation in the number of treatments does not cause significant changes to the results.
Thus, with respect to α c , Fisher's LSD test proves to be the only one that can control for this error rate regardless of the number of treatments, repetitions, and the CV of the experiments; therefore, it is the most robust test in this case.The Duncan test, in turn, lies in an intermediate situation; it controls for this rate only in the scenarios in which the number of treatments is small, and is conservative in other cases.The Tukey, SNK, and Scheffé tests exhibit the worst performance; they are conservative in all simulated scenarios.In this case, the Scheffé test is the most conservative, followed by the Tukey and SNK tests.
In the context of the experiment-wise error rates α e , the Tukey and SNK tests exhibit a similar behaviour by always presenting the estimates for this rate within the 95% CI (Tables 3 to 6).In both the cases, a variation in the number of treatments does not significantly influence the results.•Type I error rates below the lower limit of the 95% confidence interval (CI) (0.0375) for the empirical proportion of this rate.••Type I error rates above the upper limit of the 95% CI (0.0625) for the empirical proportion of this rate.
•Type I error rates below the lower limit of the 95% confidence interval (CI) (0.0375) for the empirical proportion of this rate.••Type I error rates above the upper limit of the 95% CI (0.0625) for the empirical proportion of this rate.
•Type I error rates below the lower limit of the 95% confidence interval (CI) (0.0375) for the empirical proportion of this rate.••Type I error rates above the upper limit of the 95% CI (0.0625) for the empirical proportion of this rate.
The Scheffé test shows that the estimates for α e mostly lie below the lower limit of 0.0375, except for two scenarios where a = 3 when CV = 1% and CV = 5% (Tables 3 and 4), and except for three scenarios with a = 3 when CV = 10% and CV = 20% (Tables 5 and 6), where the test controlled this rate.An increase in the number of treatments for the test causes a significant decrease in the error rate.
In general, in the context of α e , Tukey and SNK are the only tests that control for this rate and are therefore considered to be robust in all experimental conditions, regardless of the number of treatments, the number of repetitions, and CV.The Scheffé test, in turn, is in an intermediate situation, since, to some extent, it controls the experiment-wise error rate only in the scenarios with three treatments, while it is conservative in other cases.The Duncan and Fisher's LSD tests depict the worst performance, showing that they are liberal in all simulated scenarios, with Fisher's LSD test being the most liberal among them.
According to Girardi et al. (2009), the equality of the comparison-wise error rate and experiment-wise error rate would be ideal for a multiple comparison test according to the level of significance established.However, according to Perecin and Barbosa (1988), a test that controls for the comparison-wise error rate can become very liberal when applied to the entire experiment, while a test that controls for the experiment-wise error rate can become conservative in a comparison.
Indeed, the Tukey and SNK tests prove to be conservative in terms of controlling for the comparison-wise error rate, while controlling for the experiment-wise error rate equal to the nominal significance level.A similar behaviour exhibited by the tests was observed by Boardman and Moffitt (1971), Bernhardson (1975), andGirardi et al. (2009).Fisher's LSD test, which controls for the comparison-wise error rate, proves to be liberal in terms of controlling for the experiment-wise error rate, as observed by Boardman and Moffitt (1971), Bernhardson (1975), Perecin andBarbosa (1988), andGirardi et al. (2009).
Regarding conditional error rates, the comparison-wise rates α c are always equal to the conditional rates α 1 for the Scheffé test, while there is a tendency for the rates α c to be slightly higher than α 1 for each of the other tests considered (Tables 3 to 6).Meanwhile, the conditional rates α 2 are always higher than the rates α c for the five tests.The differences between these last two error rates are large when the number of treatments is small and decrease as the number of treatments increase.
Therefore, none of the means tests controls for the conditional comparison -wise error rates α 1 or α 2 .For α 1 , all the tests present estimates below the lower limit of the 95% CI calculated, being conservative regarding the control of this rate.For α 2 , all the tests, in general, show a liberal behaviour with the estimates lying above the upper limit of the CI, except in some of the scenarios with a = 9 for the Tukey and SNK tests, and some of scenarios with a = 7 and all the scenarios with a = 9 for the Scheffé test, in which the tests control this error rate.
The unconditional experiment-wise error rates α e are always equal to the conditional rates α 3 for the Scheffé test, whereas for each of the other tests, the rates α e are slightly higher than the rates α 3 (Tables 3 to 6).Further, the rates α e are always lower than the conditional rates α 4 , and although the differences between them are considerable, they decrease as the number of treatments increase.For the Scheffé test, however, the differences between these rates for scenarios with a high number of treatments are not large.
Regarding the control of the conditional experiment-wise error rates, we observed that the Duncan and Fisher's LSD tests control for the error rate α 3 , regardless of the number of treatments.The Tukey and SNK tests, in turn, control for this rate in all the scenarios with a = 3 and a = 3; a = 5, respectively.However, for the other scenarios, when they do not control this rate, both tests are conservative.The Scheffé test, however, proves to be conservative in most scenarios, controlling for this rate only in certain scenarios with a = 3.For α 4 , all the tests show a liberal behaviour, except for the Scheffé test in certain scenarios with a = 9, where this rate is controlled.
Figures 1 and 2 summarise the tendency of the means tests, explaining every rate for the observed tests as the number of treatments increase.Only variation in the number of treatments is considered here, as this is the only factor that led to significant alteration of the results.Therefore, we consider r = 10, CV = 10%, and α = 5%.The lines between the dots represent what occurred to the rates with an increase in the number of treatments.For simplicity, only the case with r =10 and CV = 10% is reported; the results for the other simulated scenarios are similar.Based on the above results, it can be inferred that the combined use of the overall ANOVA F-test and the multiple comparison tests can change the type I error rates of the means tests.Duncan and Fisher's LSD tests do not control for the experiment-wise error rate α e , while an increase in the number of treatments leads to an increase in this rate.However, considering the conditional experiment-wise error rate α 3 , we find that the tests control for this rate, regardless of the number of treatments because the nominal significance level for the ANOVA F-test determines the upper limit for the α 3 error rates (Bernhardson, 1975).In general, Fisher's LSD test controls for the comparison-wise error rate α c , whereas the Tukey and SNK tests control for the experiment-wise error rate α e .The Duncan and Fisher's LSD tests control for the conditional experiment-wise error rate α 3 .The Scheffé test does not control for any of the error rates considered, possibly because the test can be used for all possible contrasts and not only for the pair-wise contrasts of means (Boardman & Moffitt, 1971).

Conclusion
Since each multiple comparison test controls for a different error rate, the choice of the test must depend on what error rates are intended to be controlled.If the decision is to control for the comparison-wise error rate α c , then Fisher's LSD test is the most suitable.If the decision is to control for the experiment-wise error rate α e , then the Tukey and SNK tests are recommended.If the intention is to control for the conditional experiment-wise error rate α 3 , then the Duncan and Fisher's LSD tests can be used.Type I error rates, in general, did not show significant changes with a variation in the number of repetitions or the CV, but exhibited changes with a variation in the number of treatments of the trials.When choosing the most applicable test, in addition to the type I error rate, the power function should be considered.It is always desirable for tests with good performance to maintain the coverage of the type I error rate, which should simultaneously have great power.Research on the power function is extremely exhaustive and therefore should be developed by future studies.

Figure 1 .
Figure 1.(A) Unconditional comparison-wise error rates α c ; (B) conditional comparison-wise error rates α 1 ; and (C) conditional comparison-wise error rates α 2 for the various multiple comparison tests with 10 replications, a nominal significance level of 5%, and coefficient of variation = 10%, according to the variation in the number of treatments.

Figure 2 .
Figure 2. (A) Unconditional experiment-wise error rates α e ; (B) conditional experiment-wise error rates α 3 ; (C) conditional experimentwise error rates α 4 for the various multiple comparison tests with 10 replications, nominal significance level of 5%, and coefficient of variation = 10%, according to the variation in the number of treatments.

Table 1 .
Multiple comparisons applied to the data of the height of the first productive branch, in centimetres, for five cultivars and a new genotype of cotton.
*Significant at the 5% level; means followed by the same letter do not differ from each other at a significance level of 5%.The assumptions of normality and homoscedasticity were verified using the Shapiro-Wilk test (p-value = 0.3399) and Bartlett test (p-value = 0.2617), respectively.

Table 2 .
Multiple comparisons applied to the data of the final stand of five cultivars and a new genotype of cotton.
ns Not significant at the 5% level; means followed by the same letter do not differ from each other at a significance level of 5%.The assumptions of normality and homoscedasticity were verified using the Shapiro-Wilk test (p-value = 0.1575) and Bartlett test (p-value = 0.6062), respectively.

Table 3 .
Type I error rates for multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 1%.

Table 4 .
Type I error rates of multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 5%.

Table 5 .
Type I error rates of multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 10%.

Table 6 .
Type I error rates of multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 20%.