Modeling citrus huanglongbing data using a zero-inflated negative binomial distribution

Zero-inflated data from field experiments can be problematic, as these data require the use of specific statistical models during the analysis process. This study utilized the zero-inflated negative binomial (ZINB) model with the logand logistic-link functions to describe the incidence of plants with Huanglongbing (HLB, caused by Candidatus liberibacter spp.) in commercial citrus orchards in the Northwestern Parana State, Brazil. Each orchard was evaluated at different times. The ZINB model with random effects in both link functions provided the best fit, as the inclusion of these effects accounted for variations between orchards and the numbers of diseased plants. The results of this model show that older plants exhibit a lower probability of acquiring HLB. The application of insecticides on a calendar basis or during new foliage flushes resulted in a three times larger probability of developing HLB compared with applying insecticides only when the vector was detected.


Introduction
Orange (Citrus sinensis) production encompasses the largest acreage of any Citrus species in Brazil.Citrus sinensis was introduced in Brazil during the colonial period, but did not become the largest orange production acreage in the world until the mid-1980s (Couto & Canniatti-Brazaca, 2010).Frozen orange juice, which is largely exported, is one of the most important agricultural commodities in Brazil.International trade could further improve if the current phyto-sanitary issues did not inhibit orchard productivity in Brazil (Paulillo, 2006).The crop is affected by numerous diseases, but only a few are of agricultural importance.The most destructive citrus disease in Brazil is Huanglongbing (HLB, caused by the bacteria Candidatus liberibacter spp.).The disease poses threat to citrus crops around the world.The first cases were detected in southern China in 1919.The disease has since spread to 40 countries in Asia, Africa, Oceania, South America and North America (Bové, 2006).In 2004, HLB was first detected in the Americas, in Araraquara County, Central São Paulo State, Brazil (Teixeira et al., 2005).
HLB does not immediately kill the tree, but decreases the tree health and productivity over a period of years.These health issues eventually compromise the economic viability of entire orchards over periods of seven to ten years, particularly when disease control decisions are not expediently made (Gottwald, da Graça, & Bassanezi, 2007).In healthy orchards, the bacterium is transmitted by the vector Diaphorina citri Kuwayama (Hemiptera: Psyllidae).In some cases, orchards can contain large D. citri populations, but be completely HLB free.Thus, a data set encompassing the variable of incidence of diseased plants may contain many zeros.
Data sets with an excess number of zeros are commonly analyzed using zero-inflated models.Famoye and Singh (2006) fit a zero-inflated Poisson model (ZIP) to investigate domestic violence.Xie, Wei, and Lin (2008) investigated pharmaceutical data using a ZIP model with a random effect.Hall (2000) used the same model to investigate repeated measures in agriculture experiments.However, data are over dispersed, with or without random effects, these models are not appropriate.Thus, the use of a zero-inflated negative binomial (ZINB) model is more appropriate for analyzing these types of data sets.Garay, Hashimoto, Ortega, and Lachos (2011) effectively used a ZINB model for over dispersed data.Yau, Wang, and Lee (2003) used a ZINB model with random effects to analyze the hospitalized times of pancreatic patients in different healthcare facilities.
The HLB incidence data collected from multiple orchards over multiple time periods requires a similar approach.Thus, we investigated the applicability of the ZINB mixed regression model to analyze the HLB incidence in Brazilian orchards.The excessive zero count is included in the ZINB model, along with over dispersion in the negative binomial distribution.Furthermore, the response variable dependence on repeated prevention measures over time is accommodated by a random effect that is incorporated into the linear predictive model.

HLB incidence
Sweet orange orchards were analyzed in this study.The orchards are located in the Northwestern Parana State, an area where HLB has been monitored due to the prevalence of D. citri.The scale of citrus production in this area ranges from small to large orchards.However, various size orchards are managed using similar methods, including application of pest and disease control measures.The HLB incidence data were collected on four occasions at intervals of 90 days in 2011.
The orchards were planted with the "Pera", "Valência" and "Folha Murcha" sweet orange (Citrus sinensis) varieties.Three methods were used to make management decisions related to controlling the vector.First, insecticide was applied only after detecting the presence of the vector.Second, insecticide was applied based on the timing of new foliage flushes.Finally, a calendar-based application was used based on three different tree agesorchards with trees from 0 to five, six to ten and greater than eleven years old.These data were collected by projects supported by the Núcleo de Biotecnologia Aplicada from the Centro de Ciências Agrárias in the Universidade Estadual de Maringá, Maringá County, Parana State, Brazil.

Zero-inflated negative binomial model
Suppose that variables following the zero-inflated negative binomial distribution given by: where p is the probability of observing at least one diseased plant, λ is the mean number of diseased plants and k is the dispersion parameter.When k → ∞ , 1/ k approaches 0 as the negative binomial approaches the Poisson distribution.Thus, the ZIP and ZINB are closely related distributions, and the ZINB distribution can be seen as a flexible extension of the ZIP (Minami, Lennert-Cody, Gao, & Román-Verdesoto, 2007).The expected estimate and variance of the ZINB distribution are, respectively given by: ) is the response variable, the number of diseased plants in the i-th orchard during the j-th evaluation, which considers that ij Y follows the ZINB distribution with the inclusion of the ZIP model covariables, is based on Lambert (1992).We investigated the logarithmiclink function (log-link) of the parameter λ , which was used to linearize the mean from the negative binomial.In addition, the logistic link function (logit-link) of the parameter p , which represents the proportion of zeros, was also analyzed.Furthermore, the observations can be treated independently among the orchards, but correlations may exist among records from the same orchards that were recorded at different times.These instances were explicitly modeled by incorporating a random effect into the linear predictors.Thus, the functions that connect the model parameters and covariates (the link functions) are described by: logit( ) where X and G are the covariate matrices (m × n) considered in the models, which included management 2 (1 = insecticides applied to trees when producing new foliage flushes, 0 = otherwise), management 3 (1 = calendar application, 0 = otherwise), Pera (1 = Pera, 0 = otherwise), Valência (1 = Valência, 0 = otherwise), age 2 (1 = from 6 to 10 years old, 0 = otherwise), age 3 (1 = greater than 11 years old, 0 = otherwise), Evaluation 2 (1 = May, 0 = otherwise), Evaluation 3 (1 = August, 0 = otherwise) and Evaluation 4 (1 = December, 0 = otherwise).The management 1 (insecticide application only when a vector is present), Folha Murcha, age 1 (age from 0 to 5 years old) and Evaluation 1 (January) scenarios were adopted as references.β and γ represent the regression coefficients vectors, while represent the unknown parameters with random effect vectors, respectively.We assumed that u and v were independent and distributed as the inflated component).Therefore, two data modeling approaches were developed, including the λ vector with a non-related p , or p as a function of λ .The same relationships were considered for the ZINB model.In addition, the odds ratio (OR) and relative risk (RR) of the link functions (logit and log with random effects) can be calculated.Jiang (2007) applied BLUP (the best linear unbiased prediction, McGilchrist, 1994) procedures from the generalized linear mixed model (GLMM) method to maximize the sum of the components of the log-likelihood function l is the log- likelihood for u and v .Thus, 1 l and 2 l are given by: . The maximization of the loglikelihood function was achieved using the EM algorithm due to its stabilization ability (Dempster, Laird, & Rubin, 1977;McLachlan & Krishnan, 2007).Thus, we assumed a latent variable l of Y and Z is: = + , the log- likelihood function for the complete data set with ij ξ and ij η separated to facilitate the parameter estimates is given by: Acta Scientiarum.Agronomy Maringá, v. 38, n. 3, p. 299-306, July-Sept., 2016


The EM algorithm replaces ij z based on the conditional expected value and Lee (2003) reported that the expected value of ij z is given by: ˆl u + ) and ( ( 1)  ˆl = + is achieved using two sets of Newton-Raphson algorithms, as given by: and the corresponding elements of the information matrix upon convergence.The residual maximum likelihood (REML) corrects the bias from the maximum likelihood by estimating the variance components (Yau & Lee, 2001;McGilchrist, 1994;Wood, 2011).
The method used in this article was developed using the R version 3.0.2.statistical software package, in which regression models with and without random effects were adjusted based on the program created by Andy Lee and Kelvin Yau in SPlus and adapted for R by Dave Atkins.

Results and discussion
The incidence of HLB diseased trees in orchards was frequently low, resulting in data sets with large numbers of zeros, justifying the use of zero-inflated models (Figure 1).When the ZINB model was fit to these data, the variables had no influence on the proportion of zeros (healthy trees) (Table 1).
The estimate of 1/ k (1.977) indicates overdispersion in these data, providing compelling evidence to fit the ZINB model.This ZINB model (Table 1) assumes the independence of the response variable (incidence of HLB-diseased trees).However, the origin of the response variable is the incidence of HLB-diseased plants in every orchard, but at different times (repeated measures).Thus, the assumed correlation between the number of diseased plants and the lack of data independence is typical in this sort of analysis.Therefore, the ZINB model with random effects represents the preferable model for analyzing these repeat measurement data sets.
The ZINB with random effects estimates for both link functions (logistic and logarithm) are shown in Table 2.Both the magnitudes of the estimates and the standard errors decreased compared with the model described in Table 1.Furthermore, the relatively high variance component estimates ( 2 ˆu σ = 0.903) and odds ratio (OR) confidence intervals reinforce the need to incorporate the random effects.Plants older than eleven years (age 3 ) and at later assessment dates (evaluations 2, 3 and 4) yielded significantly different results.Thus, orchards with trees greater than eleven years old exhibited a lower HLB-disease incidence compared to orchards with 0 to 5 year old (age 1 ) trees (OR = 2.314 with CI (95%) = (1.426;3.755)).Significant reductions in the number of zeros in the data set were observed during the later assessments dates (evaluations 2, 3 and 4), indicating an increase in the number of HLB-diseased plants over time.
In the non-inflated component of the model, the estimates and their standard errors displayed acceptable differences when compared with the ZINB model (Table 1).The variance component estimate ( 2 ˆv σ = 0.434) is relatively small compared with that observed in the zero-inflated component.Both the application of insecticides to trees producing new foliage flushes (management 2 ) and the calendar based approach (management 3 ) exhibited a significantly higher relative risks (RR = 3.304 and RR = 3.187) than orchards receiving insecticide only when the (D. citri) vector was observed in the orchard (management 1 ).Thus, the application of insecticides to trees with new foliage flushes and via the calendar based method correlated with a higher number of diseased plants than in orchards where insecticide is only applied when the vector is detected, with RR = 3.304 and CI (95%) = (2.281;4.785), and RR = 3.187 and CI (95%) = (2.046;4.963), respectively.The highest D. citri vector populations generally occurred during the vegetative flush of young foliage.Increased infection rates are related to the highest populations of the vector, with outbreaks during the spring and summer periods (Yamamoto, Paiva, & Gravena, 2001;Bassanezi et al., 2010).The expected ratio of symptomatic to infected plants displayed high variability throughout the year, but was lower in the autumn and winter and higher in the spring and summer.Based on the current model (Table 2)  The inclusion of random effects in the model reduces the estimate of the dispersion parameter from 1.997 to 0.232.However, these numbers are still significant, suggesting substantial overdispersion.The random effects model estimates exhibit greater precision, as illustrated by the OR values from ages 2 and 3 in the zeroinflated component (Table 1).
The log-likelihood, deviance and Pearson residual results verify that the zero-inflated negative binomial model with random effects in both link functions provides a better fit for the sampled data.
The quantile-quantile plots of the random effects u and v illustrate that the estimates possess a near-normal distribution, which can be partially used to validate the ZINB model with random effects (Figure 2).Furthermore, the half-normal plot, in which the residuals lie within the simulation envelope, suggests that the model provides a good fit, despite the overdispersion effect (Figure 3).

Conclusion
The ZINB model with the random effects provides a more appropriate fit for the HLBdiseased tree incidence over time compared to models that do not incorporate these random effects.Furthermore, we increased the parameter estimate precision by considering intercepts on an orchard by orchard basis.
We were able to detect that age 3 exhibited lower probabilities for contracting HLB-disease compared with trees less than five years old.In addition, the orchards exhibited a higher probability for possessing HLB-diseased trees as the season progressed.The application of management 2 and management 3 led to three times more diseased plants than the insecticide application method based on the presence of the vector.
, the covariables affecting the mean in the ZIP model (the non-inflated component) may or may not be the same factors affecting the probability ( p ;

l
is the log-likelihood function of ij y given conditionally fixed u and v vectors based on the respective link functions (2 and 3), and 2 , the log-likelihood function 1 values of γ , u , β and v , respectively.,u I γ is the second negative derivative of l ξ , which accounts for β and u .,v I β is the second negative derivative of l η , which accounts for β and v .The maximization of the log-likelihood function most recent values of û and v

Figure 2 .
Figure 2. Quantile-quantile plots of the random effects u and v based on the ZINB model with random effects.

Figure 3 .
Figure 3. Half-normal plot for the ZINB model with random effects.

Table 1 .
Estimates, standard errors and confidence intervals (CI 95%) for the odds ratio and relative risk associated with the fitted ZINB model.

Table 2 .
Estimates, standard errors and confidence intervals (CI 95%) for the odds ratio and relative risk associated with the fitted ZINB model with random effects.