Improvement of the Wald method applied to the evaluation of zero-inflated binomial linear functions

The Wald method is grounded on a statistic based on the asymptotic approximation of normal distribution. The method has shown incoherent values at a nominal level of confidence for the probability of coverage in intervallic estimates, mainly in small samples, noticeable in linear functions formed by binomial proportions. Current analysis improves this method used in inferring from binomial linear functions, taking into consideration zero-inflated samples. Improvement was assessed by Monte Carlo simulation techniques within different scenarios. Results show that the improvement proposed is recommended in situations in which sampling proportions are close to 0,5 and produce a maximum variance of the binomial proportions involved in the composition of the linear function.


Introduction
The Wald method is highly relevant among the known procedures in literature for inference from binomial proportions.The method, widely used to compare two binomial proportions, is characterized essentially for being asymptotic, where the distribution of the estimator is approximately normal.Due to this approximation, numerous studies show that the method presents shortcomings with regard to results of coverage probability and its use in small samples.Alternative methods are proposed to correct this deficiency.An improvement to the Wald method, proposed by Agresti and Coull (1998), briefly consists of adding four pseudo-observations, two successes and two failures, in the expression of the proportion estimator.This procedure is known as the 'add -4 method'.However the more general problem of interval estimation for a linear function of binomial proportions mentioned by Price and Bonett (2004), including pairwise comparisons, complex contrasts, interaction effects and simple main effects (BONETT; WOODWARD, 1987), are factors that influence the probability coverage estimate.
Studying the Wald method and comparing it to other methods using the bootstrap approach, Carari et al. (2010) came to the conclusion that the Wald method presented probabilities of coverage with rates lower than the confidence coefficient´s nominal rates, thus jeopardizing its practical application to small samples.Regard to the add-4 method, the study showed that it stood out by producing adequate results for probabilities of coverage and intervals with shorter lengths.
The Wald method has also been used in dealing with linear functions which involve binomial proportions, also known as binomial families.A Acta Scientiarum.Technology Maringá, v. 37, n. 1, p. 47-54, Jan.-Mar., 2015 generalization of this method with its approach is stated by Price and Bonett (2004) as a confidence interval for the parameter rate where: i n is the reference sample size for i-th binomial population; ; δ i is a known co-efficient and specified by researcher; q is the number of coefficients involved in the function.Even with the above generalization, the Wald method still presents the flaws mentioned and in this context alternative methods have emerged.More details may be found in Price and Bonett (2004), Tebbs and Roths (2008) and Cirillo et al. (2009).
It is worth mentioning that the Wald method applied to the comparison of two binomial proportions or generalized for binomial linear functions put forth in the literature does not consider zero-inflated binomial (ZIB) samples.In this case, the use of these methods would certainly exacerbate the deficiencies mentioned previously with regard to coverage probability and applications to small samples.Silva and Cirillo (2010) warn that, even assuming the adequacy of the model, some zeros may be considered outliers, and thus different methods of assessment are sensitive to this anomaly.
Consequently, robust assessment methods must be arrived at which will consider the presence of divergent data and provide a coherent estimate of the parameter required.Faced with this problem, methods which deal with the effect of outliers on estimates is still the focus of research.Andrade et al. (2014) have proposed a bootstrap algorithm which looks at the effect of divergent observations and/or influential on estimates for non-linear parameter models.
While keeping focus on tallying data, Silva et al. (2012) studied the zero-inflated effect on a Poisson model according to sampling size and different parametric rates inferring from a zero-inflated Poisson (ZIP) model.The authors reached the conclusion that discrimination of ZIP and Poisson through a score test was recommended on the basis of a sampling size greater than n = 40 in samples with a high proportion of null rates.Wood et al. (2005) proposed two alternatives to estimate the probability of success in binomial samples tainted with divergent observations.These alternatives referred to two estimators differentiated by arithmetic average and rationalized means of the proportions observed.
After comparing estimators variances, the authors come to the conclusion that an estimator's recommendation will apply at different situations characterized by the distribution of proportions and the number of trials (n) performed.
In view of a scarcity of robust, zero-inflated methods to estimate binomial linear functions, current research is characterized by the proposal for an improvement of the Wald method applied to the intervallic binomial linear functions.The above turns the method robust to zero-inflated binomial samples and replaces the maximum likelihood estimates by robust estimates.Several scenarios among different parametric configurations are assessed via Monte Carlo to validate the method.

Material and methods
Following the objectives proposed, the method was performed in two steps, specified in sections 2.1 and 2.2, with details below.
Simulation of zero-inflated binomial samplings.
Using Monte Carlo simulation techniques, the zero-inflated binomial samples were generated while taking into account the ZIB model (RUCKSTUHL; WELSH, 2001), characterized by the mixture of two components in such a way that one component presumes that the occurrence of zero is defined by a γ probability, while the other component represents a binomial distribution with a (1-γ) probability.The ZIB model is thus defined by the following expression (2) and the variance defined as where γ is a probability of zero occurrence and m the number of Bernoulli experiments.Using the model given in (2), set m = 100 Bernoulli experiments for n samples sizes, the parametric rates assumed in the Monte Carlo simulation process are described in Table 1.
Table 1.Parametric rates to generate zero-inflated binomial samples.Keeping the parametric rate configurations, estimators for robust to zero-inflated binomial proportions are defined by zib π .This estimator was obtained as a combination of estimators found in Ruckstuhl and Welsh ( 2001) and Silva and Cirillo (2010) .
where mle π is the maximum likelihood estimator of π given in ( 4) where: The expression presented in (3) is based on the likelihood disparity of E-estimators (RUCKSTUHL; WELSH, 2001) and s ρ (x) represents a function that minimizes the disparity.
where c 1 and c 2 are affinity constants.
The function argument is fixed, where n p (y) is the probability for a Binomial distribution, considering the estimate of π given by (4).The rates for s are set in 1 and 2, defining the estimator π zib in two approaches mentioned in current research as the incorporation of ρ 1 and ρ 2 components.We would like to emphasize that the structure of ρ 1 and ρ 2 in the estimation process is understood as a systematic component taking into consideration that the researcher may choose which function will be assumed.Note that by assuming u = 1, ρ 2 = ρ 1 suggests that ρ 2 is a generalization of ρ 1 differing only in the asymptotic properties.
In this context, the rates for affinity constants c 1 and c 2 are defined on the basis of the component in such a way that, upon assuming the component ρ 1 , the coefficients u = c 2 = 1 are fixed and a value for c 1 < c 2 = 1 is investigated.Thus, ρ 1 (x) is prone to a greater increase when x → ∞.
Keeping the c 1 < c 2 = 1 inequality in mind, according to Ruckstuhl and Welsh ( 2001), the maximum likelihood estimates tend to be more robust.Taking into consideration ρ 2 , it is assumed that c 1 = 0.1 keeping the c 1 < c 2 = 1 restriction, whereas the rate of u is examined so as to reduce the increase of ρ 2 (x) when x → ∞.
It is worth underscoring that the accuracy and precision of the estimator (3) depend on the rates of the affinity constants c 1 and c 2 which make it robust to expected numbers of null values.Consequently, the research for these constants was carried out by a computer routine.
The intention of Silva and Cirillo (2010) was to reproduce Tables for rates of u and c 1 in two situations of s ρ ( ) x described in (6).After generating the binomial samples, the structure of the binomial linear functions was represented by the parametric rate, as shown in ( 8) where q is the total number of binomial populations, the i-th coefficient associated with the success proportion regarding to the i-th binomial population is expressed as δ i , following specifications shown in Table 2.
Table 2. Coefficients used for linear function specifications.
Family q Coefficient vector used in composition of F F1 3 For each F linear function representing a binomial family, the intervallic estimates for F were numerated, taking into account Wald's confidence intervals according to expression (1).Maximum likelihood estimates were replaced by zib π estimates with the systematic ρ 1 and ρ 2 component.
Finally, according to assessment scenario (Table 1), the intervals adapted for robust zero-inflated proportions were compared by a 100 (1-α)% interval for exact probability of coverage for a fixed value of F(8) defined by ( 9) where I (y 1 ,…,y q ) equals 1 if the intervals contains F (8), when Y 1 = y 1 ,…,Y g = y g equals zero if the interval does not contain F ( 8).An approximation is obtained from 2000 Monte Carlo simulations by means of estimated interval percentages which include the F parameter calculated from a program developed by R 3.00 software (R DEVELOPMENT CORE TEAM, 2011).

Results and discussion
Taking into consideration the evaluation scenarios mentioned in Methodology (Section 2.1), the number of Bernoulli m = 100 trials in this first step was established when obtaining the study samples for the recommended methods.
With this specification, mle π maximum likelihood estimates and zero-inflated robust as represented by π were obtained in binomial samples generated via Monte Carlo with the null observations percentages nearing 20 and 30% as per the parametric values specified in the γ = 0.2 and 0.3 mixture probability.Results are shown in Tables 3-6.In short, results made it clear that, in fact, in zero-inflated contaminated binomials, estimates for maximum likelihood were not accurate.This statement might be confirmed from the bias results, including situations of greater size sampling.However, when taking into consideration π estimates, it was noted that for almost all sample sizes and γ rates on an average the relative biases were less than 0.01, including small swings due to the Monte Carlo error in π = 0.5 Tables (3 and 4) and π = 0.7 rates (Tables 5 and 6).
Based on results on zib π estimates accuracy, the composition of binomial linear functions for the Wald method was conducted and coverage probabilities were calculated.For comparison purpose, a 95% nominal confidence level was taken into consideration.Each binomial family was represented by F 1 , F 2 , F 3 and F 4 , respectively with regard to 1, 2, 3 and 4 coefficient vectors, described in Table 2. Thus, the graphics with probability estimates are shown as follows in the Figures 1 -8:    Keeping a mean proportion of null values around 20% (γ = 0.20) of sampled observations, the results shown in Figures 1-4 made it clear that the increase in sampling size resulted in a decrease of coverage probability, with rates much lower than the nominal confidence level.
This was demonstrated by arranging the binomial families using zib π estimates with the use of ρ 1 and ρ 2 components.However, when the null observation proportion was increased to about 30% of sample units (γ = 0.30), while taking into consideration the parametric values which maximize the variance of binomial proportions, that is, π = 0.5, the binomial families whose zero-inflated proportions were estimated with ρ 1 components showed probabilities of greater coverage at the nominal level of confidence (Figure 5).The same result for all sample sizes was observed when the parametric value increased, in situations where estimates were obtained using ρ 1 and ρ 2 systematic component (Figures 7-8).It is worth mentioning that the Wald method, put into context for the obtainment of the estimates of binomial families, was assessed by Cirillo et al. (2009) for the use of the infinite bootstrap algorithm recommended by Conlon and Thomas (1990).Within this approach, authors of different assessment scenarios also concluded that results related to the probabilities of coverage were incoherent with the nominal level of confidence.
Silva and Cirillo (2010) produced studies related to the use of a robust estimator used in the inference of a binomial model contaminated by the mixture of binomial populations, when samples were obtained through Monte Carlo simulations.This study used an estimator belonging to the E estimator class (RUCKSTUHL; WELSH, 2001) incorporated into the ρ 1 (x) (8), a component which altered the E estimator.Several c 1 affinity constant rates were considered, specified in rates 0.1 ≤ c 1 ≤ 0.9 sample sizes equal to 10, 50 and 80, besides the mixture rates equal to 0.20 and 0.40.The main conclusive results were illustrated in the recommendation to assume c 1 = 0.1 for samples greater than n = 50.Already confirmed results were described regarding to flows noticed in the Wald method and the choice of c 1 constants based on sampling size and degree of contamination for the results listed in this work.
The Wald method, when using zero-inflated proportion estimates obtained by the π estimator incorporated into the systematic ρ 2 component, may be recommended in situations with proportions which maximize the binomial family variance, that is π 0.7, since, for this parametric value, the scenarios evaluated led to coverage probabilities greater than 95%.

Conclusion
The use of the Wald method incorporated into estimates for zero-inflated binomial proportions using the ρ 2 component showed results in line with the nominal confidence level of binomial proportions.In practical terms, this method is recommended for samples in which proportions are close to 0.7 with proportions close to 0.3.

Table 3 .
Comparative results of mle

Table 4 .
Comparative results of mle

Table 5 .
Comparative results of mle

Table 6 .
Comparative results of mle π and zib π estimators taking into account the parametric rate π = 0.7 with c 1 = 0.1 and c 2 = 1 restriction characterizing the systematic ρ 2 component.