Zero-inflated beta regression model for leaf citrus canker incidence in orange genotypes grafted onto different rootstocks

Data with excess zeros are frequently found in practice, and the recommended analysis is to use models that adequately address the counting of zero observations. In this study, the Zero Inflated Beta Regression Model (BeZI) was used on experimental data to describe the mean incidence of leaf citrus canker in orange groves under the influence of genotype and rootstocks of origin. Based on the model, it was possible to quantify the odds that a null observation to mean incidence comes from a particular plant according to genotype and rootstock, and estimate its expected value according to this combination. Laranja Caipira rootstock proved to be the most resistant to leaf citrus canker as well as Limão Cravo proved to be the most fragile. The Ipiguá IAC, Arapongas, EEL and Olímpia genotypes have statistically equivalent chances.


Introduction
The agronomic framework Similar to other agricultural crops, the orange cultivation demands management and practices. Currently, the various diseases that affect the orange groves include citrus canker, caused by Xanthomonas citri subsp. citri. (Gonçalves-Zuliani et al., 2016). This disease does not have control, the recommendation is the elimination of infected plants, however, some studies have pointed possible means of control, indicating that in the future the disease will be controlled (Sauer et al., 2015). Current techniques to prevent citrus canker in orange groves suggest the use of grafting onto rootstocks resistant to disease (Reis et al., 2008).
Choosing a suitable rootstock should take many factors into account, and the possibility of establishing a resistance criterion to citrus canker for canopy variety is an important feature in the integrated management or prevention of disease, especially in regions where citrus cankeris endemic (Danos, Berger, & Stall, 1984, Agostini, Graham, & Timmer, 1985. In the state of Paraná, there are few studies on the impact of citrus canker on specific genotypes under the influence of a rootstock. This study evaluated the resistance of Laranja Doce (Citrus sinensis (L.) Osbeck) var. Pera genotypes grafted onto different rootstocks to the action of the bacterium Xanthomonas citri subsp. citri. The modeling was performed using the Zero Inflated Beta Regression model.

The statistical framework
Beta distribution is commonly used to understand the variability of a random variable supported on an open interval = ( , ) ⊂ ℝ, with , ∈ ℝ and . If, in particular, denotes a proportion, then = 0 and = 1, i.e. = (0,1) ⊂ ℝ. A statistical approach that examines the existence and quantify any possible associations between , and the presence of a set of factors that exert influence on their behavior, is the Beta Regression Model.
With extensive application, Beta Regression Models have been used to understand specific aspects of a multitude of phenomena, such as scores in tests conducted on dyslexics and non-dyslexics individuals (Smithson & Verkuilen, 2006), the canopy cover rate of trees in forests according to the basal area, tree height, local fertility, and others factors (Korhonen, Korhonen, Stenberg, Maltamo, & Rautiainen, 2007), the selective collection rate in some cities, considering socioeconomic, demographic factors and others (Ibáñez, Prades, & Simó, 2011), the relationship between trade and business cycles (Mendonça, Silvestre, & Passos, 2011), and many others areas such as health (Moraes, Rocha, & Machado, 2012) or meteorology and climatology (Mullen, Marshall, & McGlynn, 2013).
A severe restriction on the use of Beta distribution is its support, an open interval. This difficulty can be easily overcome with the use of a Two-Part Model. Although the model known as Zero Inflated Beta Model is relatively recent, the use of Inflated Models is well addressed in the literature. This research exposes the Zero Inflated Beta Model, and uses this theory to model a real agronomy database. To obtain estimates for leaf citrus canker incidence, it was counted the number of infected leaves over total number of leaves. This process was repeated in five evaluations and, at the end of the experiment, the average of the observed incidences was calculated, resulting in 360 sampling leaf citrus canker mean incidences.

Material and methods
where: > 0, > 0 are shape parameters and the symbol Γ denotes the Gamma function.
Beta distribution is a very versatile probability distribution family, may take a wide variety of forms such constant format (uniform distribution, when = = 1), symmetrical unimodal or bathtub (when = 1), negative asymmetrical unimodal or strictly increasing (when > ), positive asymmetric unimodal or strictly decreasing (when In regression analysis is common, and reasonably convenient, to model the expected value of the response variable, that is, when possible, is it more interesting than the adjustment made on a parameter that represents the mean of random variable. In the case of Be ( , ), note that the expected value of is a function of and parameters.
The idea is to work with a new parameterization of the Beta distribution, for this, consider = /( + ), so, 0 1 and = + , so, > 0, so that the reverse substitutions are uniquely determined by = , so, > 0 and = (1 − ), so > 0. Considering the new parameterization, the probability density function of Be ( , ) distribution will be denoted by Be ( , ) and it is described as Equation 3: ( 3) where: ∈ (0,1) is a position parameter, > 0 is a scale parameter and Γ denotes the Gamma function.
Since the support of a random variable with Beta distribution is the unit interval , it is a limitation for using this distribution in practical applications that have zero observations. One way to overcome this difficulty is to use Mixture Models theory (specifically, a Two-Part Model) and compose a new distribution, here denoted by BeZI ( , , ), based on two distinct distributions: one to model the zeros observations and another to model the observations belonging to the interval.
The BeZI ( , , ) probability density function, is expressed by Equation 4: The idea of modeling the expected value of Be ( , ) distribution was already under discussion for some time in the works of Jorgensen (1997), Paolino (2001) and Kieschnick and McCullough (2003), e.g., however, the regression model exposed by Ferrari and Cribari-Neto (2004) became popular for formulating more carefully the modeling of the expected value in Be ( , ) distribution, based on Be ( , ) distribution parameterization, and to establish an association with GLM theory, a class well described in the literature by Nelder and Wedderburn (1972). These, and others works gave rise to a number of new studies and approaches, such as (Cepeda-Cuervo, 2015).
To make a more appropriate modeling in the context of small samples, the likelihood inference is discussed by Ferrari and Pinheiro (2011), whose proposed adjusted statistics were more reliable in simulation. A discussion about inference in a Beta Regression Model with non-constant precision, also in small samples cases, can be found in Cribari-Neto and Queiroz (2012).
The diagnostic analysis in Beta Regression Models was again discussed by Rocha and Simas (2011), Ferrari, Espinheira, and Cribari-Neto (2011), Chien (2011 and Anholeto, Sandoval, and Botter (2014), with some new influence measures, graphical tools and a new residual type for diagnosis. Variable selection methodology, in non-constant dispersion modeling, was proposed by Zhao, Zhang, Lv, and Liu (2014). Furthermore, Latif and Yab (2015) expose a way to determine the optimum design for an experiment with regression model Be ( , ) involving a single predictor variable.
The BeZI ( , , ) regression model received a formal treatment in Ospina and Ferrari (2012). Because it is a more recent study, complementary theory that naturally comes is not yet so affordable, even though the whole development acquired for Be ( , ) regression model is closely associated with it. One can find in the literature a new Likelihood Ratio Test for Beta Inflated Models (Pereira & Cribari-Neto, 2014b). A discussion of error detection in Beta Inflated modeling (Pereira & Cribari-Neto, 2014a).

Model definition
Consider a set of independent and identically distributed random observations = ( , … , ) , where each component , with = 1, … , , such ∼ BeZI ( , , ). Keep in mind the formulation of a Generalized Additive Model for Location, Scale and Shape (GAMLSS) exposed in Rigby and Stasinopoulos (2005), each parameter , and of BeZI ( , , ) distribution can be associatedwith the random variable by a link function. The Zero Inflated Beta Regression Model can be defined by Equation 6: where: = 1, … , . Furthermore, the parameters , and satisfy the functional relation ships, according Equation 7: where: = , … , , = , … , and = , … , are parameters vectors with order , and , respectively, associated with the three distribution parameters and the vectors = , … , , = , … , and = , … , are vectors with explanatory variables observations associated with , and to -th individual.

Expected value and variance
Considering the link functions , and for distribution parameters , and , respectively, the expected value and variance of are expressed by Equations 8 and 9: The interval estimates for E ( ) and Var ( ), are conditioned by obtaining the respective standard errors, which can be determined with Delta Method.

Results and discussion
A BeZI ( , , )regression model was adjusted to understand the behavior of the mean incidence of leaf citrus canker (denoted by ), influenced by two categorical covariates, the rootstock (denoted by PE and the categories Laranja Caipira, Limão Cravo, Tangerina Cleópatra and Tangerina Sunki) and another to represent the genotype (denoted by GE and the categories Arapongas, Bianchi, ELL, IAC, IAC 2000, Ipiguá IAC, N58, N59 and Olímpia). Figure 1 illustrates the histogram, box plot and empirical cumulative probability function, for the mean incidence considering each rootstock. The highest zero proportion corresponds to rootstock Laranja Caipira (74.44%) while the lowest corresponds to the rootstock Limão Cravo (10.00%). Considering the dispersion of observations, visually, the less dispersed one correspond to rootstock Laranja Caipira, followed by rootstock Tangerina Cleópatra, the two intermediaries in zero proportion observations. Table 1 presents the summary measures for each rootstock, note that the lower mean value corresponds to Laranja Caipira rootstock (0.0095), also with the lowest standard deviation (0.0230); on the other hand, the highest mean corresponds to Limão Cravo rootstock (0.0544), also with the highest standard deviation (0.0428). In the genotypes cases, the summary for mean incidence of measures in each genotype is shown in Table 2, it is possible to observe that the lowest means correspond to Ipiguá IAC (0.0144) and N59 (0.0187) genotypes, whose standard deviations are among the three lowest, while the highest means are observed for EEL (0.0354) and Olímpia (0.0331).   Figure 2 shows the histogram, box plot and empirical cumulative probability function, for the mean incidence considering each genotype. The highest zero proportion observations correspond to Arapongas and Ipiguá IAC genotypes (60.00 and 57.50%) while the lowest correspond to IAC and IAC 2000 genotypes (22.50% both), however, in general, genotypes present proportions similar to each other, i.e. the minimum and maximum are not so different from the others. Also note that, based on box plots, the set of observations visually less dispersed corresponds to IAC and N59 genotypes, which although have low proportions of null observations, concentrate their observations very close to zero.

Adjustment and considerations
Taken as reference the combination Laranja Caipira + Ipiguá IAC, the adopted model after the selection between candidate models has the Equation 10.
The results adjusted for coefficients of BeZI ( , , ) distribution can be observed in Table 3. There is evidence that Arapongas, EEL and Olímpia genotypes influence on zero proportion for leaf citrus canker mean incidence seems to not differ statistically from that influence observed for reference combination (Laranja Caipira + Ipiguá IAC). On the other hand, estimates indicate that influence exerted by the other three rootstocks (Limão Cravo, Tangerina Cleópatra and Tangerina Sunki) is statistically significant compared to reference combination.
In the case of and coefficients, results are in Table 4. There are indications that the influence of Arapongas, EEL and Olímpia genotypes on the mean incidence of leaf citrus canker, in this case, seems statistically different from the reference combination. Furthermore, only the Limão Cravo rootstock differs statistically from reference combination. The model suitability evaluation, if positive, allows drawing conclusions based on the adjustment. This assessment should be performed by diagnostic criteria, such as graphical analysis of residuals from the optimization process.
The standardized residuals for discrete and continuous components are observed in Figure 3a and b, respectively. For discrete component, observations 11, 15, 21, 22, 66, 272, 273 and 274 are further away from the other observations. In continuous component, distance between observations is not so much expressive.
The withdrawal of observations one by one, and altogether, did not result insignificant differences for estimates or different interpretations of the obtained model in the presence of all observations, in this way we decided to keep all observations. Figure 4 shows the simulated envelopes to assess the adequacy of the standardized (a), randomized (b) and weighted (c) residuals. In any graphics, very few observations fall outside the confidence bands, which strengthens the assumption that the Zero Inflated Beta Regression model accommodates the inherent variability of the response variable.
Finally, in Figure 5 are exposed worm plots built with a model individually for each component, discrete (a) and continuous (b). None of the observations lies beyond the confidence bands, indicating a good fit.
Based on graphical analysis, it is assumed that the model has the ability to briefly represent the observed reality in the experiment. A pseudoequal to 0.49989, with a confidence interval (0.41813; 0.57361), allows us to understand that the adjustment can account for approximately 49.99% of the variability of the studied incidence.
Since the graphical analysis indicate good fit for the proposed model, it is possible to interpret the results and make conclusions based on modeling. Estimates obtained for coefficients, shown in Table 3, allow to compute the odds compared to observe a null mean incidence of leaf citrus canker between each genotype or rootstock with the reference combination. The point and interval estimates can be found in Table 5.
Considering a chance equals one for an observation of reference combination (and also of genotype and/or rootstock classified as statistically equivalent) be equals zero, Figure 6 illustrates a color matrix for estimated chance of a particular plant presents a null mean incidence of leaf citrus canker, compared to the reference combination genotype + rootstock.  Clearly, Laranja Caipira rootstock presents a resistance to leaf citrus canker superior to the others. The chances to observe a null mean incidence with Limão Cravo, Tangerina Cleópatra and Tangerina Sunki rootstocks are, respectively, 0.0265, 0.1700 and 0.1443 times the chance of Laranja Caipira rootstock. Based on these results, Limão Cravo rootstock has the highest susceptibility to leaf citrus canker. Regarding genotypes and the resistance to leaf citrus canker, four genotypes are statistically equivalent, Ipiguá IAC (reference) Arapongas, EEL and Olímpia. The chance that a null mean incidence is found in individuals of N58 and N59 genotypes is indistinguishable from each other (and equal to 0.3336 times the chance of Ipiguá IAC genotype), as well as the IAC and IAC 2000 (equivalent to 0.1282 times the chance of Ipiguá IAC genotype). Moreover, the chance of Bianchi genotype presents a null mean incidence is equal to 0.1811 times the chance of Ipiguá IAC genotype.
Since the observed mean incidence is different from zero, that is, considering that the plant is infected, the results that include this situation are listed in Table 4 and based on them it can be concluded that: • Once the plant is infected, Bianchi, IAC, IAC 2000, N58 and N59 genotypes are indistinguishable from reference combination, as well Tangerina Cleópatra and Tangerina Sunki rootstock; • Arapongas, EEL and Olímpia genotypes are statistically different from reference combination, but, note that the contribution of the estimated mean incidence of leaf citrus canker of each is positive in any case, indicating that the estimated mean incidence is higher in these genotypes; • Only Limão Cravo rootstock is statistically different from the reference combination, and also contributes to increase the mean incidence of leaf citrus canker.
In Figure 7, the color matrix is showed in accordance with each combination rootstock + genotype for mean incidence of leaf citrus canker estimated by the model. The combination Limão Cravo + Arapongas is more susceptible to the disease, with an expected mean incidence equal to 6.04%, followed by combinations Limão Cravo + EEL (5.54%) and Limão Cravo + Olímpia (5.40%). On the other hand, the reference combination had the highest resistance, with expected mean incidence equal to 0.29%, followed by Laranja Caipira + Olímpia (0.50%) and Laranja Caipira + N58 or N59 (both 0.73%).

Conclusion
The BeZI ( , , ) regression model was adequate to represent the mean incidence of leaf citrus canker. The approach based on a regression model allows to isolate and quantify influence effects on the response variable.
The discrete component enabled to understand the odds of observing a zero mean incidence and quantitatively explained that the Laranja Caipira rootstock is the most resistant to leaf citrus canker as well Limão Cravo is more susceptible to the same disease. In addition, the Ipiguá IAC, Arapongas, EEL and Olímpia genotypes have odds of observing a zero mean incidence statistically equivalent.
The combination between the discrete and continuous components allowed to quantify the expected mean incidence and demonstrated that Limão Cravo rootstock and Arapongas, EEL and Olímpia genotype stend to positively increase the expected mean incidence. The highlight for lowest incidence expected average is the reference combination.