Quantile regression for genomic selection of growth curves

. This study evaluated the efficiency of genome-wide selection (GWS) based on regularized quantile regression (RQR) to obtain genomic growth curves based on genomic estimated breeding values (GEBV) of individuals with different probability distributions. The data were simulated and composed of 2,025 individuals from two generations and 435 markers randomly distributed across five chromosomes. The simulated phenotypes presented symmetrical, skewed, positive, and negative distributions. Data were analyzed using RQR considering nine quantiles (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9) and traditional methods of genomic selection (specifically, RR-BLUP, BLASSO, BayesA, and BayesB). In general, RQR-based estimation of the GEBV was efficient — at least for a quantile model, the results obtained were more accurate than those obtained by the other evaluated methodologies. Specifically, in the symmetrical-distribution scenario, the highest accuracy values were obtained for the parameters with the models RQR0.4, RQR0.3, and RQR0.4. For positive skewness, the models RQR0.2, RQR0.3, and RQR0.1 presented higher accuracy values, whereas for negative skewness, the best model was RQR0.9. Finally, the GEBV vectors obtained by RQR facilitated the construction of genomic growth curves at different levels of interest (quantiles), illustrating the weight – age relationship.


Introduction
Growth curves help summarize the weight-age (or yield-time) relationship in animals and plants based on a few interpretable parameters, such as mature weight (asymptotic weight) and maturity rate (growth rate).Growth-curve models aim to explain the growth process over time.This information regarding the demand and care needed for each plant or animal development stage facilitates effective economic decision-making.
Pong-Wong and Hadjipavlou (2010) used a two-step method to analyze growth trajectories using a Genomic Selection (GS) approach in the current post-genomic era.In this framework, growth models fit weight-age (or yield-time) data in the first step, whereas growth curve parameter estimates are used as the dependent variables in GS methods in the second step.Ibáñez-Escriche and Blasco (2011) showed that GS modified the growth curve with markers acting simultaneously on the three parameters of the Gompertz curve.Using a genome-wide association study (GWAS) framework, Howard et al. (2015) performed a study based on polynomial coefficients to identify genomic regions affecting growth and feed intake curves in Duroc boars.Silva et al. (2017) identified GWAS-based candidate genes, whose biological functions can be useful in explaining the genetic basis of postnatal growth in pigs.
These genomic methods focus on estimating single nucleotide polymorphism (SNP) marker effects in the mean body weight (BW) over time; the function is defined for the expected value of BW conditional on X (age), or simply E(BW|X).Mosteller and Tukey (1977) showed that regression curves can be estimated for different quantiles of the response variable distribution, providing a complete picture of the regression (Cade & Noon, 2003).This method, called Quantile Regression (QR) (Koenker & Bassett Jr., 1978), can estimate parameters for all portions of the probability distribution of the response variable (e.g., BW).This enabled us to better understand the relationship between response (BW) and explanatory variables (age).In addition, QR does not require assumptions regarding the error distribution and is robust to outliers (Oliveira et al., 2021b).
To deal with high-dimensional problems, such as a large number of parameters and a small number of observations, methods combining shrinkage estimation and variable selection (BLASSO, BayesA, BayesB, etc.) have been proposed for GS.Under a QR framework, this method is called regularized QR (RQR) (Li & Zhu, 2008).Nascimento et al. (2017) proposed the use of RQR in GS studies was proposed by Nascimento et al. (2017) using simulated data.This approach has been used in plant and animal breeding studies and has shown to be a promising technique.Nascimento et al. (2019a) estimated the Genomic Estimated Breeding Values (GEBV) for different levels (quantiles) of the probability distribution associated with flowering time in common bean.Oliveira et al. (2021b) evaluated the efficiency of RQR throughout the breeding process using simulated data from autogamous plants.Barroso et al. (2017) used the same methodology to estimate the effects of SNP markers on growth curves in pigs.
In this study, we propose an RQR methodology for the genomic prediction of growth curves using simulated data with distributions with different levels of asymmetries and compare its prediction accuracy values with those of traditional GS methods (namely, RR-BLUP, BLASSO, BayesA, and BayesB).

Simulated population
Publicly available simulated data from the QTLMAS2009 (Coster, Bastiaansen, Calus, van Arendonk, & Bovenhuis, 2010) were used in this study.The dataset consisted of 2,025 individuals from two generations with complete information on 453 SNP markers that were randomly distributed over five chromosomes.Individuals in this dataset consisted of 25 parents (20 females and 5 ma les) and 2,000 progenies from 100 full-sib families (a combination of male and female parents).Each family included 20 full-sibs.
Of the 100 families, 50 (1,000 individuals) had phenotypic records (BW) generated at five time points (t= 0, 132, 265, 397, and 530 days), according to the logistic growth curve: where  1 is the asymptotic weight;  2 is the inflection point of the curve; and  3 is the slope of the curve.

Genomic breeding values and skewness
True genomic breeding values (TGBVs) for each curve parameter were simulated for each individual as the sum of additive effects based on six quantitative trait locus).One large QTL and five small QTLs were assigned to each of the three curve parameters, assuming the same heritability (h 2 = 0.50) (Coster et al., 2010).The original dataset was modified by inserting positive and negative skewness to evaluate the impact of skewness (non-normal phenotypic distributions) on the genomic prediction accuracy of growth curves.We also evaluated normal phenotypic distribution (the original simulated dataset).Random errors were added to the TGBVs based on three

Estimation of growth curve parameters
A logistic nonlinear regression model (model 1) was used to estimate the growth curve parameters based on individual weight-age data.The estimated growth curve parameters ( 1 : asymptotic weight,  2 : inflection point of the curve, and  3 : slope of the curve) were used as phenotypes in subsequent analyses.

Genomic prediction analyses
Genomic prediction (GP) analyses were performed using the data from all individuals with phenotypic and genotypic records.Different methods were used in this study.

Traditional GS methods
The methods used in this study include BLASSO (De Los Campos et al., 2009), RR-BLUP, BayesA, and BayesB (Meuwissen, Hayes, & Goddard, 2001).These methods assume normal distributions for the phenotype values.Bayesian methods were implemented using 100,000 MCMC (Markov chain Monte Carlo) iterations, with a burn-in and thin layer of 10,000 and five iterations, respectively.
Regularized quantile regression.Data were analyzed using regularized quantile regression (RQR) (Li & Zhu, 2008) based on nine quantiles (τ): 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9.This method consists of obtaining the vector of marker effect estimates (τ) that solves the following optimization problem: where τ ∈ ]0,1[ indicates the quantile, µ is the mean, x ik is the incidence of the ith individual and the kth marker, τ k is the effect of the ktb marker, is the sum of the absolute values of the regression coefficients, and λ is the regularization parameter.The parameter ρτ(.) is denoted as a check function (Koenker & Bassett Jr., 1978), and is defined by Thus, the values of τ k represent marker effects in the τ th quantile of interest.A grid of shrinkage parameter values () ranging from 0 to the posterior  estimate from BLASSO is used.However, we reported results that yielded the highest prediction accuracy.
The genomic estimated breeding values (GEBVs) were obtained for each quantile as , where τ represents the τ th quantile of interest.Subsequently, the mean genomic growth curve for the ith animal and the τ th quantile was estimated as where ŷ (τ)ij is the predicted BW for animal i at age j (t ij ) and quantile τ; μ ̂(τ)α ̂1, μ ̂(τ)α ̂2 and μ ̂(τ)α ̂3 are the adjusted trait means (parameter estimates for the logistic model) for  1 ,  2 and  3 , respectively.Mean genomic growth curves were also obtained using other methods evaluated in this study (BLASSO, RR-BLUP, BayesA, and BayesB).
In addition, for each genomic selection method and scenario (symmetric, positive, and negative skewness), the slope of the regression of GEBVs on TGBVs was calculated as a measure of prediction bias.

Comparison of methodologies under a GS approach
Genomic prediction analyses were performed using a two-fold cross-validation approach, with each fold defined using Ward's hierarchical clustering method (Ward Jr., 1963) based on genotypes.The accuracy of genomic prediction for the 13 different genomic selection models (BLASSO, RR-BLUP, BayesA, BayesB, RQR 0.1 , RQR 0.2 , …, RQR 0.9 ) was calculated as the average (from each fold) Pearson correlation coefficient between the pre-adjusted phenotypes and GEBVs divided by the square root of heritability.Subsequently, bar graphs were generated using accuracy values and their respective standard errors.
In addition, Cohen's kappa coefficient (Cohen, 1960) was used to calculate the percentage of the top 10% of individuals with the highest GEBVs across 13 different genomic selection models (BLASSO, RR-BLUP, BayesA, BayesB, RQR 0.1 , RQR 0.2 , …, RQR 0.9 ).Cohen's Kappa coefficient is given by, C=NC-C Random /(1-C Random ), where NC is the relative observed agreement among methodologies, and C Random is the hypothetical probability of random agreement.

Computational features
Data analysis was carried out on a computer with a 2.60GHz core i7 processor and 16GB of RAM.Analyses were performed using the functions nls (nonlinear logistic model fitting), rq (regularized quantile regression), and BGLR (Bayesian LASSO, RR-BLUP, BayesA, and B) (de Los Campos & Pérez-Rodríguez, 2014) from the R package (R Core Team, 2021) statistics, quantreg (Koenker, 2015) and BLGR (Pérez, de Los Campos, Crossa, & Gianola, 2010).The plausibility of these values was assessed separately for each MCMC chain using the Raftery-Lewis and Geweke convergence diagnostics in the boa (Smith, 2007) R package.All data and R codes used in this study are freely accessible on the Web: https://github.com/licaeufv.

Results
Figures 1, 2, and 3 show the distributions of phenotypic and genotypic (Breeding Values -BV) values for all parameters  1 -mature weight,  2 -inflection point, and  3 -slope considering symmetric, positive, and negative skewness phenotypic distributions, respectively.As expected, compared with the distributions of the genotypic values, the distributions of the phenotypic values presented larger ranges for all the simulated scenarios (Figures 1, 2, and 3).The distributions of genotypic values were symmetrical for all scenarios and parameters.These distributions were concentrated in the middle, lower, and high quantiles of phenotypic distributions for the symmetric, positive, and negative skewness scenarios, respectively (Figures 1, 2, and 3).
Acta Scientiarum.Agronomy, v. 46, e65081, 2024 Although evidence of significant bias was detected (Table 1), considering the mean genomic growth curves (Figures 4, 5, and 6), RQR showed the least bias compared to BLASSO, RR-BLUP, BayesA, and BayesB for all parameters and scenarios.When the distribution of phenotypic values is symmetric, RQR 0.5 and RQR best fits are the nearest curves obtained using the breeding values for each parameter of the logistic growth model (Figure 4).For the positive and negative scenarios, RQR best yielded estimates with the least bias (Figures 5 and 6).In general, when compared to the mean curve based on true values, the curve obtained by RQR best outperformed those from the other models (BLASSO, RR-BLUP, BayesA, BayesB) (Figures 4, 5, and 6).
In general, the slopes between breeding values and GEBV of all models and scenarios were significantly different from the unit (p < 0.01), indicating a significant bias in the prediction.The RQR best was not significantly different from that in the positive skewness scenario.Although evidence of significant bias was detected, the slope values derived from the RQR best were slightly lower for all traits and scenarios (Table 1).Spearman's correlation (upper diagonal) and Cohen's Kappa concordance coefficients (lower diagonal) between GEBVs were obtained using six different genomic selection models (BLASSO, RR-BLUP, BayesA, BayesB, RQR 0.5 , and RQR best ) considering the symmetric, positive, and negative skewness phenotypic distribution scenarios, as shown in Tables 2, 3, and 4.
In general, except for the RQR 0.5 model in scenario 1, Spearman's correlations varied from moderate to high positive values.The lowest Spearman's correlation was observed between the RR-BLUP and RQR best models (0.52) in the negative-skewness scenario.The highest Spearman's correlation coefficient was observed for BayesA and BayesB (1.00) in the symmetric scenario (Tables 2, 3, and 4).

Discussion
We proposed genomic selection (GS) for growth curves based on RQR to obtain curves for different parts (quantiles) of the BW distribution in the presence of skew.The use of RQR to estimate genomic breeding value was efficient because, at least for one quantile model, the accuracy values presented better results when compared to those obtained from BLASSO, RR-BLUP, BayesA, and BayesB for all scenarios (Figures 1, 2, and 3).These results are reasonable because unlike traditional methods based on conditional expectations, E(Y|X), RQR allows fitting regression models on different parts of the distribution of the variable response, enabling a complete understanding of the phenomenon under study (Barroso et al., 2017;Cade & Noon, 2003;Koenker & Bassett Jr., 1978;Nascimento et al., 2017;Oliveira et al., 2021b).Therefore, it seems possible to find the best model to represent the relationship between the dependent (phenotype) and independent (marker effects) variables, thereby increasing the predictive performance of the model.
For the symmetric scenario, the best models for predicting  1 (asymptotic weight),  2 (inflection point), and  3 (slope of the curve) genetic values were RQR 0.4 , RQR 0.3 , and RQR 0.4 , respectively.These results make sense because the best-regularized quantiles are around the center of the distribution of the phenotypic values, which, in symmetrical situations, concentrate on the major mass of probability.As expected, under the skewness situation, the RQR models with lower (RQR 0.2 , RQR 0.3 , and RQR 0.1 ) and higher (RQR 0.9 ) quantiles presented better results for predicting GBVs for positive and negative skewness phenotypic distribution scenarios, respectively.In these situations, the mass of probability is concentrated on the lower and higher quantiles for positive and negative skewness phenotypic distributions, respectively.Specifically, as with several traits, growth curve parameters can present different skewness levels, and RQR fit can improve model accuracy.In these cases, a functional relationship defined as higher (> 0.50) or lower quantiles (< 0.50) can improve GWS studies and subsequently improve the selection process of individuals in breeding programs.However, due to the infinite number of quantiles in RQR, finding the "best" one to explain the functional relationship is still a challenge.
In general, all models presented moderate to high positive Spearman correlations (Tables 1, 2, and 3).However, considering the agreement of Cohen's kappa coefficient, the classification agreement between RQR fit models and non-quantile models (RR-BLUP, BLASSO, BayesA, and BayesB) varied from moderate (0.60-0.79) to minimal (0.21-0.39) (McHugh, 2012).This result suggests differences between the quantile and nonquantile model classifications.The difference between the results of Spearman's correlations and Cohen's kappa coefficient occurs because kappa coefficients consider the possibility of random concordance.
Altogether, these concordance results indicate that using quantile regression to obtain curves on different parts of the BW distribution and even combining this information to set an RQR best model is an interesting and promising approach.In addition, the advantages of RQR are combined with those of the two-step approach.The two-step approach allows obtaining the GEBVs for any time in the observed range, ranking the animals using the GEBVs values directly for the parameter estimates and by the GEBVs of the evaluated trait.
Nonlinear QR to describe growth curves in plant breeding programs has already been used to evaluate dry matter accumulation in garlic plants by Puiatti et al. (2018;2020), and the length and width of the fruit of pepper genotypes by Oliveira et al. (2021a).In all of these studies, nonlinear QR was efficient in fitting models at different levels and classifying genotypes.In the context of genomic selection, RQR has already been successfully applied by Nascimento et al. (2019a) to predict the genetic value of individuals in traits associated with the flowering time of the common bean, showing a very promising technique for the traditional techniques of genomic selection.
In animal breeding, nonlinear RQ has been successfully used to fit the lactation curve of dairy cows and growth curves in pigs at different established quantiles (Younesi, Shariati, Zerehdaran, Nooghabi, & Løvendahl, 2019;Nascimento et al., 2019b).In genomic data, Barroso et al. (2017), using the RQR, built genomic growth curves using RQR, which allowed the identification of genetically superior individuals in relation to growth efficiency.Furthermore, RQR enables us to find, in different quantiles, the most relevant markers for each trait evaluated and their respective chromosomal positions.
RQR is a promising and efficient technique in plant and animal breeding.However, more studies using different sizes of datasets (individuals and markers) are needed to address the efficiency of RQR.Other issues are related to the shrinkage parameter, which can be defined using a grid of values, cross-validation, or a Bayesian approach (Alhamzawi, Yu, & Benoit, 2012) and the definition of the best quantile fit model considering different levels of skewness.

Conclusion
The proposed model based on Quantile regression (QR) provided more accurate values than BLASSO, RR-BLUP, BayesA, and BayesB for all the simulated phenotypes with different skewness levels.The GEBV vectors obtained by RQR enabled the construction of genomic growth curves at different levels of interest (quantiles), comprehensively revealing the weight-age relationship.
Thus, the distribution of phenotypic values (growth curve parameters) was simulated as symmetric, positively skewed, and negatively skewed.The modified datasets are freely accessible at https://github.com/licaeufv.

Figure 4 .
Figure 4. Mean genomic growth curves using the genomic estimated breeding values (GEBVs) obtained by 6 different genomic selection models (BLASSO, RR-BLUP, BayesA, BayesB, RQR0.5, and RQRbest) and the breeding values for each parameter of a logistic growth curve considering the symmetric scenario.

Figure 5 .
Figure 5. Mean genomic growth curves using the genomic estimated breeding values (GEBVs) obtained by 6 different genomic selection models (BLASSO, RR-BLUP, BayesA, BayesB, RQR0.5, and RQRbest) and the breeding values for each parameter of a logistic growth curve considering the positive skewness scenario.

Figure 6 .
Figure 6.Mean genomic growth curves using the genomic estimated breeding values (GEBVs) obtained by 6 different genomic selection models (BLASSO, RR-BLUP, BayesA, BayesB, RQR0.5, and RQRbest) and the breeding values for each parameter of a logistic growth curve considering the negative skewness scenario.