Genomic prediction using the lmekin function from the coxme R package

. The increasing use of genomic selection (GS) in plant and animal breeding programs has led to the development of software that fits models based on unique scenarios. Accordingly, several R packages have been developed for GS. The lmekin function from the coxme R package was one of the first functions implemented in R to fit models with random family effects using the pedigree – based relationship matrix. The function allows the user to provide the covariance structures for the random effects; thus, the GBLUP model can be fitted. This fitting process consists of replacing, in the traditional BLUP model, the additive relationship matrix derived from a pedigree by the additive relationship matrix derived from markers. Thus, the objective of this study was to employ the lmekin function in the context of genomic prediction by comparing the results of this function with those obtained using five R packages for GS: rrBLUP, BGLR, sommer, lme4qtl, and lme4GS. The comparisons were performed considering the computational times and predicted values for a wheat dataset and simulated big data. In addition, we implemented a 5-fold cross-validation scheme through considering the values predicted by the lmekin function for the wheat dataset. The results indicated that the lmekin function was effective in predicting genomic breeding values considering multiple random effects and relatively small sample sizes. The rrBLUP package processed the fastest for the scenario with only one genetic random effect, and the high temporal efficiency of the sommer package was confirmed for the scenario with more than one genetic random effect. Differences in computational times occurred because of the different algorithms implemented in the packages to estimate the variance components.


Introduction
The superiority of the genomic selection (GS) method, proposed by Meuwissen, Hayes, and Goddard (2001), in relation to phenotypic selection has been confirmed in several plant and animal breeding programs (Resende, Silva, & Azevedo, 2014;Budhlakoti et al., 2022).This superiority has mainly resulted from a reduction in generation intervals, as young plants can be genotyped early in life and their genomic breeding values are computed for selection purposes, resulting in an improved selection accuracy.Moreover, GS considers the real genetic relationship of individuals instead of the average pedigree-based relationship (Mrode, 2014;Resende et al., 2014).
In genomic prediction models, the number of regressor variables (p), i.e., the number of markers, usually vastly exceeds the number of observations (n), which leads to the large "p" and small "n" problem (p >> n).Thus, GS requires methods that use some type of variable selection or shrinkage estimation procedures (de los Campos, Hickey, Pong-Wong, Daetwyler, & Calus, 2013;Resende et al., 2014;Budhlakoti et al., 2022).Alternatively, a mixed-model methodology can be employed within the context of GS.One of the most widely used methods consists of replacing, in the traditional BLUP model, the pedigree-based relationship matrix (A) with the marker-based relationship matrix (G), resulting in a method called the Genomic Best Linear Unbiased Prediction (GBLUP) (VanRaden, 2008;Yang et al., 2010;Mrode, 2014;Resende et al., 2014).
Genomic selection methods based on mixed models can be generalized to comparatively complex scenarios; for example, the prediction of additive and non-additive genetic effects using relationship information derived from markers and pedigree (Muñoz et al., 2014).Despite all computational advances, there are few open-source statistical software that allow genomic predictions considering multiple random effects (Covarrubias-Pazaran, 2016).
According to Caamal-Pat et al. (2021), the first R packages developed for GS were BLR (de los Campos et al., 2009) and rrBLUP (Endelman, 2011).Motivated by the need to implement various genomic methods using Bayesian regression in a single program, Pérez and de los Campos (2014) developed the BGLR package, which supports models for continuous (censored or non-censored) and categorical (binary or ordinal multinomial) traits.To predict genomic breeding values incorporating more than one variance component in addition to the residual error, Covarrubias-Pazaran (2016) implemented the Sommer R-package, which is based on the Henderson mixed model equations.Other programs have implemented GS methods that utilize free software with an intuitive interface and without command lines (Cruz, 2016;Resende, 2016;Azevedo et al., 2019).
In R software, mixed model analyses have been performed mainly using the nlme (Pinheiro & Bates, 2000) and lme4 (Bates, Mächler, Bolker, & Walker, 2015) packages.However, neither package fits models with random genetic effects.One of the first functions implemented in R to fit such linear models is the lmekin (linear mixed models with kinship) function from the coxme package (Therneau, 2020), which is an extension of the lme function from the nlme R-package.This function has been widely used in human genetic studies but has few applications in plant and animal breeding.
As an extension of the lme4 R-package, to fit mixed models applied to animal breeding, Vazquez, Bates, Rosa, Gianola, and Weigel (2010) developed the pedigreemm R-package, which fits generalized linear models with pedigree-based information.However, the package does not allow the user to provide a variancecovariance matrix for genetic effects, making its use in GS unfeasible.Recently, two R packages have been developed, lme4qtl (Ziyatdinov et al., 2018) andlme4GS (Caamal-Pat et al., 2021), which both represent extensions of the lme4 package and fit GBLUP-like models with multiple random effects.
Unlike the pedigreemm R-package, the lmekin function allows the user to input a covariance matrix for random genetic effects, such as the genomic relationship matrix.Furthermore, the function allows for estimating multiple variance components, making it possible to fit the GBLUP model with multiple random effects.
The sommer package was compared with the rrBLUP and BGLR packages, and with four other R packages: ASReml-R, regress, EMMREML, and MCMCglmm (Covarrubias-Pazaran, 2016).The lme4qtl package was compared with the SOLAR package and the lmekin function, considering a genome-wide association study (GWAS) and not the prediction context in GS (Ziyatdinov et al., 2018).The lme4GS package was compared with the BGLR and sommer packages (Caamal-Pat et al., 2021).However, to the best of our knowledge, no study has compared the computational performances of these packages when fitting GBLUP-type models.Thus, the objective of this study was to employ the lmekin function in the context of genomic prediction by comparing the results of this function with those obtained using five different R packages for GS: rrBLUP, BGLR, sommer, lme4qtl, and lme4GS.

Material and methods
The lmekin function fits the general linear mixed model as follows (Therneau, 2020): where y is the vector of observed values, β is the vector of fixed effects, b ∼ N(0, σ e 2 ()) is the vector of random effects, e ~ N(0, Iσ e 2 ) is the random error vector, and X and Z are incidence matrices for fixed and random effects, respectively.The relative variance matrix D for random effects depends on the vector of the unknown variance parameters θ = (σ 1 2 , . . ., σ l 2 )/σ e 2 .The model also assumes that cov (b, e) = 0, such that y ∼ N(Xβ, V) with V = σ e 2 (()  + ), where σ e 2 is the residual variance and I is an identity matrix.Pinheiro and Bates (2000) presented a form that is more convenient for expressing the relative variance matrix D using a relative precision factor Δ, which represents any matrix that satisfies D -1 = Δ T Δ; Δ is the Cholesky factor of D -1 .
The estimation of model parameters using the lmekin function follows the routines implemented in the lme function, where the profiled log-likelihood function with respect to θ is calculated using orthogonaltriangular decomposition methods for solving augmented least-squares problems of the form (Pinheiro & Bates, 2000): where Q is an orthogonal matrix and R is an upper-triangular matrix.The approach of changing the contribution of the marginal distribution of the random effects into extra rows for the response and incidence matrices (left side of the equality above) is called pseudo-data representation because it creates the effect of the marginal distribution by adding "pseudo" observations.To estimate variance components, the function implements a hybrid optimization scheme.It begins by calculating the initial estimates of the θ parameters, then uses a moderate number of EM iterations and switches to Newton-Raphson iterations (Pinheiro & Bates, 2000;Therneau, 2020).
The model in (1) can be employed in the context of GS, where the random effects correspond to the genomic breeding values (g), with ~(0,   2 ); where   2 is the additive genetic variance, and K is the additive relationship matrix derived from markers (GBLUP).In the case of matrix K derived from the pedigree, the model is the traditional BLUP (Caamal-Pat et al., 2021).
The fit of the GBLUP model using the lmekin function of the coxme package was compared with that of other R packages for GS: rrBLUP (Endelman, 2011), BGLR (Pérez & de los Campos, 2014), sommer (Covarrubias-Pazaran, 2016; Covarrubias-Pazaran, 2018), lme4qtl (Ziyatdinov et al., 2018) and lme4GS (Caamal-Pat et al., 2021).Except for BGLR, all other packages are based on a frequentist approach, where the variance components are estimated by maximum likelihood (ML) or restricted maximum likelihood (REML).The difference between the methods is that REML produces estimates of the variance components that are unbiased because it removes the fixed effects from the model before estimating the variance components.By comparison, in the ML method, the fixed effects are estimated without considering the loss of degrees of freedom due to the estimation of these effects, thus causing bias.
In the BGLR package, the GBLUP model is fitted via Bayesian inference using the Gibbs sampler and the Reproducing Kernel Hilbert space [RKHS] kernel to specify the variance-covariance structures of the genomic breeding values (Pérez & de los Campos, 2014).We considered 30,000 iterations for the Gibbs sampler, with the first 5,000 iterations discarded as burn-in (Caamal-Pat et al., 2021;Crossa et al., 2010).The correct number of iterations and burn-in length are defined when evaluating the convergence of the chains; however, this is beyond the scope of this work.The BGLR assigns scaled-inverse chi-square densities to the variance components, which are indexed by a scale and degree-of-freedom parameter.By default, the degrees of freedom were set equal to five, and the scale parameter was based on the sample variance of the phenotypes.Further details are available in Pérez and de los Campos (2014).In the frequentist analysis, models were fitted using the default parameters of each package.

Applications
To compare the fit of the GBLUP model using different R packages, we used two datasets: a wheat dataset (commonly used in R packages for GS), and a larger dataset simulated by Covarrubias-Pazaran (2016).The wheat dataset consists of a population of 599 wheat lines genotyped using 1447 Diversity Array Technology (DArT) markers, coded as 0 (absence) and 1 (presence).Markers with a minor allele frequency lower than 5% were removed, and missing genotypes were imputed as   ~(̂), where ̂ is the estimated allele frequency computed from the non-missing genotypes.Following this process, 1279 markers were retained for analysis.The phenotypic trait was the grain yield (2-year average) of the 599 wheat lines evaluated in four environments (Crossa et al., 2010;Pérez & de los Campos, 2014).For this study, we considered only the phenotypes for environment one (low rainfall and irrigation), which were standardized to a sample variance equal to one.Data with phenotypic values (Y), the matrix of markers (X), and the additive relationship matrix computed from a pedigree (A) are publicly available with the BGLR package.The GBLUP model was fitted as follows (VanRaden, 2008;Yang et al., 2010;Mrode, 2014;Resende et al., 2014): where y is the response vector for the grain yield variable (environment 1), 1 is a vector of ones, and μ is an intercept.Z is an incidence matrix for the genomic breeding values (u), assumed as ~(0,   2 ), where   2 is the additive genetic variance associated with the markers.G is the genomic relationship matrix, given by G=XX T /m, where X is the matrix of markers centered and standardized (i.e., each marker was centered by subtracting the mean and standardized by dividing by the sample standard deviation), and m is the number of markers (Caamal-Pat et al., 2021).The random error vector was assumed to be in Model (1).To ensure that G is a positive definite matrix, the same can be obtained by  * =  + 10 −6 , where I is an identity matrix (Resende et al., 2014).Heritability, given by ℎ 2 =   2 /(  2 +   2 ), measures the extent to which phenotypic variance is explained by additive genetic effects.Model (3) can be extended by jointly considering the marker and pedigree information.Thus, the following model can be fitted as follows (Crossa et al., 2010;Mrode, 2014;Resende et al., 2014;Caamal-Pat et al., 2021): where T is the incidence matrix for additive polygenic effects (a), assumed as ~(0,   2 );   2 is the additive genetic variance associated with the pedigree, and A is an additive relationship matrix derived from the pedigree, whose entries are equal to twice the kinship coefficient between pairs of lines (Crossa et al., 2010).The other terms of the model in (4) are the same as those in (3), and it is assumed that the random effects of the model (u, a, and e) are independent.Heritability was then calculated as ℎ 2 =   2 /(  2 +   2 ), with   2 = (  2 +   2 ).Unlike other R packages considered in the analysis, the lmekin function fits linear mixed models with multiple genetic effects as a linear combination of variance components (Zhao & Luan, 2012;Therneau, 2020).Thus, the function fits the model in (4) as follows: where Wg=Zu+Ta and W is the incidence matrix for the total genetic effects (g = u + a), assumed to be g~N(0,), where  =   2  +   2 , with   2 ,   2 , G and A previously defined for models (3) and ( 4).The equivalence between the models in ( 4) and ( 5) results from the assumption of independence between genetic effects.The individual components of g can be obtained as u ̂=σ u 2 GΣ -1 g ̂ and a ̂=σ a 2 AΣ -1 g ̂ (Mrode, 2014).
The REML estimates of variance components and the BLUP values obtained using the lmekin function were compared with the values obtained using the other R packages for both genomic models with and without pedigree information.The computing times required to fit the models using the different packages were also recorded.
We additionally implemented a k-fold cross-validation scheme based on the ranefUvcovNew function of the lme4GS package (Caamal-Pat et al., 2021) and incorporating the predicted values using the lmekin function.To verify that the cross-validation scheme was correctly employed, the results were compared to those obtained using the ranefUvcovNew function in the lme4GS package.The wheat dataset was divided into five disjoint groups (k = 5, five-fold) of approximately equal size.Each of these fifths acted as a validation set and the remainder acted as a training set.GBLUP models with and without pedigree information were fitted using the training set, and the phenotypes for the validation set were predicted.The Pearson correlation between the observed and predicted values for individuals in the validation set and the mean squared error (MSE) were measured.This procedure was repeated five times for each model (with and without pedigree information).Each time, a different group of observations was treated as a validation set, resulting in five estimates of correlation and MSE.The accuracies of the prediction and MSE were computed by averaging these values.In cross-validation, the number of folds is usually fixed at 5 or 10 (James, Witten, Hastie, & Tibshirani, 2013;Caamal-Pat et al., 2021).
In the comparison among R packages, we also considered a big data scenario with phenotypic and genotypic data simulated by Covarrubias-Pazaran (2016) for 5,000 individuals with 10,000 markers for a single trait and a single component of variance, with heritability h 2 = 0,5.The fitted model was the same as in (3), whose genomic relationship matrix was calculated using the A.mat function of the rrBLUP package, and the variance components were estimated by REML.Thus, we compared the computing time of the lmekin function with the other R packages to fit the GBLUP model in a big data scenario.All analyses were performed using R software, version 4.1.3(R Core Team, 2022), and a PC with a 3.6 GHz Intel Core i3 processor and 8 GB RAM memory.The R codes used to recreate the analyses are available upon request.

Example 1: Wheat dataset (Markers and Markers + Pedigree)
The results of the BLUPs values, variance component estimates, and computational times using six different R packages are presented in Tables 1 and 2 for the entire dataset.In both fitted models (only markers and markers + pedigree), the estimates obtained using the lmekin function were identical to those obtained using the other R-packages with the frequentist approach (rrBLUP, sommer, lme4qtl, and lme4GS), as shown in Tables 1 and 2. Thus, the lmekin function can also be used in genomic prediction studies.Slight differences among the estimates obtained using the BGLR package and those obtained by the other frequentist packages were expected.This is because the GBLUP model is fitted in the BGLR package via Bayesian inference using Gibbs sampling (Pérez & de los Campos, 2014) whereas the other packages fit the model via frequentist inference using the G-REML/G-BLUP procedure.In Bayesian analysis, a higher number of iterations is required to achieve the same parameters than in the frequentist approach.
The rrBLUP package fit the model with only one random effect (marker) in the shortest time (1.69 s), followed by the sommer (4.43 s), and the longest was the BGLR package (57.15 s).The lme4qtl and lme4GS packages, both extensions of the lme4 package (Bates et al., 2015), presented similar computational times (5.82 s and 6.00 s, respectively).The lmekin function, an extension of the lme function (Therneau, 2020), when compared to the lme4 counterpart packages (lme4qtl and lme4GS), took twice as long to fit the same model (Table 1).
For the model with two genetic random effects (Table 2), the lmekin function (22.37 s) was faster than the lme4qtl (39.78 s) and lme4GS (40.02 s) packages, fitting the model in approximately half the time.In the presence of genetic random effects with more than one covariance matrix, the lmekin function differs from other R packages in that it provides a single vector of BLUPs as given by the sum of the genetic effects (Therneau, 2020).
The sommer package had the lowest computational time (only 5.47 s) among the employed packages, at four times faster than the lmekin function, seven times faster than the lme4qtl and lme4GS packages, and 18 times faster than the BGLR (102.35 s).A similar result was found by Caamal-Pat et al. (2021), who compared the computational times of the lme4GS, sommer, and BGLR packages.The speed of sommer is due to the methods implemented in the package to estimate variance components, which combine the direct-inversion Newton-Raphson (NR) or Average Information (AI) algorithms with the auto-decomposition technique of the relationship matrix.The rrBLUP package was not employed because it is limited to a single variance component, in addition to the error (Covarrubias-Pazaran, 2016;Caamal-Pat et al., 2021).
The inclusion of pedigree information reduced the residual variance from 0.53 to 0.43 in the frequentist approach (sommer, lme4qtl, and lme4GS packages and the lmekin function) and from 0.54 to 0.44 in the bayesian approach (BGLR package) (Tables 1 and 2).Once the phenotypes are standardized, the residual variance estimate (σ e 2 ) can be used to assess the goodness of fit of the models because it indicates the extent to which the phenotypic variance is not explained by the model (Crossa et al., 2010).Consequently, the Acta Scientiarum.Agronomy, v. 46, e64243, 2024 heritability increased from 0.5 (model with only marker information) to 0.6 (model with marker and pedigree information), with approximately 69 and 65% of the total genetic variance being explained by the markers in the frequentist and bayesian approaches, respectively.

Example 2: Five-fold cross-validation using the wheat dataset
The prediction accuracy values obtained via cross-validation using the lmekin function were identical to those obtained using the ranefUvcovNew function of the lme4GS package, indicating that the cross-validation scheme was correctly employed.Computing times for cross-validation were also close (116.16seconds for lme4GS and 109.66 seconds for lmekin in the model including marker and pedigree information).The inclusion of pedigree information in the model increased the prediction accuracy of genomic breeding values (Table 3).Lower mean square error values were also observed for the model based on the markers and pedigree.These results are in agreement with those obtained by de los Campos et al. (2009) and Crossa et al. (2010) for the same dataset.The inclusion of pedigree information in genomic prediction models is advantageous in situations with a low number of markers (the case of the 599 wheat lines that were genotyped using only 1279 DART markers).This is because, as the number of markers increases, the relative contribution of the pedigree information tends to decrease (Crossa et al., 2010;de los Campos et al., 2013).The reason for including pedigree information was to show that the lmekin function could be employed to fit GBLUP-type models incorporating multiple genetic effects such as additive, dominance, and epistatic, based on marker and pedigree information.

Example 3: Big data
A dataset simulated by Covarrubias-Pazaran (2016) with 5,000 individuals genotyped with 10,000 markers was used to compare the performance of the lmekin function with its counterparts in a big data scenario with a single random effect.We did not consider big data with more than one random effect owing to the unavailability of a PC, as it would require a system with at least 16 GB of RAM memory (Covarrubias-Pazaran, 2016).The results of the computational time are shown in Figure 1.The lmekin function (95.63 min.)as well as the lme4qtl (79.17 min.)and lme4GS (82.8 min.)packages presented the longest execution times, with the lmekin function being approximately 21 and 15% slower than the respective packages.Similar to the wheat dataset, the lme4qtl and lme4GS packages performed similarly.All three programs (lmekin, lme4qtl, and lme4GS) are based on sparse matrix methods, resulting in longer computation times because the genomic relationship matrix is not necessarily sparse.Thus, such packages are expected to have superior computational performance for data with a sparse structure of covariance matrices, such as family based studies (Ziyatdinov et al., 2018).
Among the R-packages compared, rrBLUP presented the lowest computational time (15.01 min.),followed by sommer (47.62 min.),and BGLR (53.7 min), with the last two presenting similar times.These results differed from those obtained by Covarrubias-Pazaran (2016), in that when comparing the packages sommer, ASReml, rrBLUP, regress, BGLR, MCMCglmm, and EMREML, less computational time elapsed using the package sommer for this same dataset.These differences occurred because of the algorithms used.In this study, we used the default parameters of the package, which is based on the direct-inversion NR algorithm, whereas Covarrubias-Pazaran (2016) used the direct inversion AI algorithm; both algorithms are combined with the eigen-decomposition technique of the genomic relationship matrix (Covarrubias-Pazaran, 2016).The rrBLUP and sommer packages implement numerical methods that improve time efficiency when fitting genomic models.In the rrBLUP package, the algorithm is based on the spectral decomposition of kernel ZKZ T , where Z is the incidence matrix and K is the covariance matrix for random effects (Endelman, 2011).
The main features of the packages used in this study are presented in Table 4. Similar comparisons were made by Covarrubias-Pazaran (2016) and Ziyatdinov et al. (2018).The lmekin function, like the other R packages, excepting the rrBLUP package, allows for predictions of multiple random effects, such as additive and non-additive effects.Furthermore, the lmekin function supports different structures of residual variance.In the same package (coxme) in which the function is embedded, the coxme function is also available, which allows fitting of models for censored responses (Cox model) in the same way as the linear model, i.e., using a kinship matrix (Therneau, 2020).Another R package that supports censored responses is BGLR, which treats censored data as missing and sampled from truncated normal densities.BGLR also supports categorical responses (binary or ordinal multinomial) as well as the lme4qtl package (Pérez & de los Campos, 2014;Ziyatdinov et al., 2018).Several other statistical software packages have been developed for GS, including several other packages available in the free software R (Budhlakoti et al., 2022).This is a result of the increasing use of GS in plant and animal breeding programs worldwide, which also motivated the use of the lmekin function as an alternative for genomic prediction.As shown, the function has limitations for large datasets; however, it is reliable in the output of the results and efficient in fitting genomic prediction models with multiple random effects.In addition, the methodology can be extended to the context of survival analysis (Santos et al., 2016), by fitting the mixed-effects Cox model in the same manner as the GBLUP model, thereby replacing the pedigree-based relationship matrix with the genomic relationship matrix.

Conclusion
The lmekin function was effective in genomic predictions incorporating multiple random effects and relatively small sample sizes.Differences in computational times occurred because of the different algorithms implemented in the packages to estimate the variance components.The rrBLUP package was the fastest for the scenario with only one genetic random effect, and the time efficiency of the sommer package was confirmed for the scenario with more than one genetic random effect.The lme4qtl and lme4GS packages, both extensions of the lme4 package, presented similar computational performance rates.

Figure 1 .
Figure 1.Elapsed times to fit the GBLUP model using different R packages for a simulated dataset with 5,000 individuals and a genetic random effect.

Table 1 .
Summary measures of the BLUP values, variance component estimates, and computational times using six different R packages for a model with only marker information.

Table 2 .
Summary measures of the BLUP values, variance component estimates, and computational times using five different R packages for the model with marker and pedigree information.

Table 3 .
Results of cross-validation (5-fold) using the lmekin function for a set of 599 wheat lines considering only markers and marker + pedigree information.

Table 4 .
Comparisons among R-packages used that implement mixed models with genetic random effects.