Comparison of whole genome prediction accuracy across generations using parametric and semi parametric methods

Accuracy of genomic prediction was compared using three parametric and semi parametric methods, including BayesA, Bayesian LASSO and Reproducing kernel Hilbert spaces regression under various levels of heritability (0.15, 0.3 and 0.45), different number of markers (500, 750 and 1000) and generation intervals of validating set. A historical population of 1000 individuals with equal sex ratio was simulated for 100 generations at constant size. It followed by 100 extra generations of gradually reducing size down to 500 individuals in generation 200. Individuals of generation 200 were mated randomly for 10 more generations applying litter size of 5 to expand the historical generation. Finally, 50 males and 500 females chosen from generation 210 were randomly mated to generate 10 more generations of recent population. Individuals born in generation 211 considered as the training set while the validation set was composed of individuals either from generations 213, 215 or 217. The genome comprised one chromosome of 100 cM length carrying 50 QTLs. There was no significant difference between accuracy of investigated methods (p > 0.05) but among three methods, the highest mean accuracy (0.659) was observed for BayesA. By increasing the heritability, the average genomic accuracy increased from 0.53 to 0.75 (p < 0.05). The number of SNPs affected the accuracy and accuracies increased as number of SNPs increased; therefore, the highest accuracy was for the case number of SNPs=1000. With getting away from validating set, the accuracies decreased and the most severe decay observed in the case of low heritability. Decreasing the accuracy across generations affected by marker density but was independent from investigated methods.


Introduction
Animal breeding aims to improve economic efficiency of livestock species through selection under a changing cost and income scenario.Most of the economic important traits in livestock have a polygenic and quantitative expression, i.e., controlled by a large number of genes and affected by environmental factors.Statistical analysis of phenotypes and pedigree information allows prediction of the breeding values of the selection Acta Scientiarum.Animal Sciences Maringá, v. 38, 4, p. 447-453, Oct-Dec., 2016 candidates based on Fisher's infinitesimal model (Fisher, 1919).The best linear unbiased prediction (BLUP) (Henderson, 1974) has been extensively used for estimating the breeding values in animal and plant breeding programs and it leads to great progress in economic important traits.Advances in the molecular genetics technologies have greatly increased the information available on individual genes for quantitative traits.Molecular techniques have enabled researchers to identify genetic markers that can be used to precise selection of candidates and thus increase efficiency of breeding programs.Genomic selection is a new tool for selection of animals based on marker information in addition to traditional information (Meuwissen, Hayes, & Goddard, 2001).The implementation of genomic selection takes place in two steps.Firstly, the effects of genetic markers, typically single nucleotide polymorphism (SNP), are estimated in a reference population (animals with both marker and phenotypic information).Secondly, the estimated effects of the genetic markers are used to predict genomic breeding values for individuals of validating population (animals with marker information and without phenotypic information).Meuwissen et al. (2001) suggested using genetic marker information in a statistical model of prediction of animal breeding values.They used three statistical models: a model assigning random effects to all available genetic markers ("genomic BLUP"), assuming that all markers effects are drawn from the same normal distribution, and two Bayesian models, where all ("BayesA") or a subset ("BayesB") of the random marker effects are drawn from distributions with different variances.Various types of these methods and additional methods have been subsequently suggested (Gianola, Campos, Hill, Manfredi, & Fernando, 2009).Gianola, Fernando, and Stella (2006) and Gianola and van Kaam (2008) have suggested a non-parametric treatment of genomic information by using Reproducing Kernel Hilbert Spaces (RKHS) regression, that subsequently have demonstrated with real data (González-Recio et al., 2008;González-Recio, Gianola, Rosa, Weigel, & Kranis, 2009).
Genomic selection provides a greater genetic progress in comparison with the traditional methods by increasing the accuracy of estimated breeding values and reducing generation intervals.The accuracy of genomic evaluation depends, among other factors, on the linkage disequilibrium (LD) between SNPs (Calus, De Roos, & Veerkamp, 2008).
LD is defined as the non-random association between the alleles at two different loci.LD can be caused by some factors including migration, mutation, selection or genetic drift in small populations, or any other event that may affect the genetic structure of the population.Population structure affects significantly the accuracy of genomic predictions when the selection candidates are closely related to the reference population (Habier, Fernando, & Dekkers, 2007;Habier, Tetens, Seefried, Lichtner, & Thaller, 2010) so that closer relatedness between subgroups (i.e., more recent divergence) increases LD between subgroups that leads to accurate genomic predictions (Daetwyler, Kemper, Van der Werf, & Hayes, 2012).
However, LD is decreasing due to recombination events in the meiosis before the development of the gametes of each new generation (Habier, Fernando, & Dekkers, 2009).To have sufficient reliability of genomic predictions, new genotyped and phenotyped individuals should be contributed in the reference population.Because of recombination events and decreasing the relatedness between subgroups, estimation of the marker effects should be re-evaluated at least every three generations (Hayes, 2007).
In this study, we investigated effect of marker density, trait heritability and decreasing the relatedness between training and validating population on prediction accuracy.

Simulation
The populations were simulated using the QMSim software (Sargolzaei & Schenkel, 2009) based on forward-in-time process.To achieve a mutation-drift equilibrium, a historical population consisted of 1000 unrelated animals (500 males and 500 females) were simulated 100 generation and then in order to create the linkage disequilibrium (LD), its size was gradually decreased from 1000 to 500 individuals in generation 200.
In the next step, individual of last historical generation (considered as founders) were mated randomly for 10 more generations, assuming five offsprings per dam and an exponential growth of the number of dams.
Finally, 50 males and 500 females from the last generation of the expanded population were randomly mated to generate another 10 generations.Individuals of generation 211 were considered as the training set and the individuals born at either generations of 213, 215 or 217 were regarded as validation sets.
The simulated genome consisted of one chromosome of 100 cM length.Aiming to investigate the effect of SNP density on prediction accuracy, 500, 750 or 1000 bi-allelic markers equally spaced across the genome were considered.Additive genetic effects were determined by 50 quantitative trait loci (QTL) randomly distributed through the genome.QTL effects were generated based on a normal distribution.The mutation rate of the markers and QTLs was assumed 2.5 × 10 -5 per locus per generation (Solberg, Sonesson, & Woolliams, 2008).
A trait with heritability of 0.15, 0.30 and 0.45 was simulated.The true breeding value (TBV) of each individual was the sum of the QTL allele substitution effects and polygenic effects, assuming only additive effects.Phenotypes were generated by adding residuals, randomly drawn from a normal distribution with mean equal to zero, to the TBVs.The correlation between the BV and the genomic predicted BV (r TBV , GEBV ) was used as a measure of the accuracy of GEBV prediction.

Linkage disequilibrium calculation
The extent of LD in the training populations was measured by r 2 : where freq (A1) is the frequency of the A1 allele in the population, and likewise for the other alleles in the population and D is another statistic of linkage disequilibrium that calculated as: D = freq (A1_B1) * freq(A2_B2) -freq (A1_B2) * freq (A2_B1).PLINK software (Purcell et al., 2007) was used to calculate the LD.

Statistical analysis
The general structure for the models in linear form is: where y is the vector of phenotypic records, μ is the overall mean, g j is the coefficient of marker j denoting the allele substitution effect, X j is a design matrix of genotype codes for the respective marker, and e is a vector of residuals.The data were analyzed using three different approaches: two Bayesian methods (BayesA and Bayesian LASSO) and one kernel based semi parametric method (RKHS), which was introduced by Gianola et al. (2006) for genomic evaluation.The BGLR (Bayesian Generalized linear regression) package of R software (Pérez & Campos, 2014) was used to genomic evaluations.
BayesA Meuwissen et al. (2001) proposed two hierarchical Bayesian models for GS denoted by BayesA and BayesB.In both methods data and variances of the marker positions need to be modeled.Inferences about model parameters are based on the posterior distribution.The BayesA approach applies the same prior distribution for all variances of the marker positions.
Reproducing kernel Hilbert spaces regression Gianola et al. (2006) proposed a semi parametric method in the genomic evaluations as an alternative to SNPs regressions.The hope was that these methods are capable for capturing complex interaction patterns that may be difficult to account for in a linear model framework.
In this study, each scenario repeated 10 times and estimation of marker effects was performed using the R package, BGLR (Pérez & Campos, 2014).In order to investigate effects of heritability, marker effect estimation method, number of markers and the interval between training and validating sets and their interactions on genomic prediction accuracy, the PROC GLM of SAS software was used (SAS, 2004).

Linkage disequilibrium
The r 2 values for different marker densities were presented in Table 1.As expected the r 2 value increased with increment of SNP density and highest value (0.13) was for SNP=1000.
In genome-wide association studies (GWAS) and genomic prediction studies; a minimum value of LD is needed to obtain accurate results.For instance, an average r 2 of 0.2 between markers and QTL is essential to detect QTL of moderate effect.The extent of genome-wide LD is largely determined by marker density and the past effective population size.The marker-QTL LD phase breaks over generations due to recombination in the meiosis process in each generation.Therefore, the markers effect must reestimate at least every three generations (Hayes, 2007).

Prediction accuracy
The results of analysis of variance of accuracy were presented in Table 2. Main factors including heritability, number of markers, generation of validating set, and also interaction of h2 × Number of markers, h 2 × generation of validating set and Number of markers× generation of validating set, affected the accuracy significantly (p < 0.05).Among investigated main factors, heritability and number of markers had the highest effect on accuracy.However, marker density is a factor easily controlled by researcher but heritability is a factor relating to trait structure and could not easily controlled by researcher.
Table 3 shows the accuracy of all investigated scenarios using three methods.Among three investigated methods, there were no differences in terms of accuracy but the highest mean accuracy r TBV , GEBV =0.659 was obtained in BayesA method (Figure 1a), while the computational demands of two other methods were higher.In RKHS method, choosing the reproducing kernel and bandwidth value affected the results.In this study, the Gaussian reproducing kernel with the Euclidean distance between each pair of input vectors was chosen and the model using arbitrarily chosen bandwidth value was fitted.Gianola et al. (2006) reported that in comparison with RKHS and multiple linear regression (MLR) mixed model, when gene action was additive, these two methods had the same accuracy but when gene action was non additive (additive by additive interaction), the parametric MLR was clearly outperformed by RKHS.In a simulation study, the accuracies of BayesA and BL were the same and higher than RKHS (Howard, Carriquiry, & Beavis, 2014).
Table 3 and Figure 1 (a-d).Accuracy of different investigated scenarios (For all cases Standard Error < 0.04).
With increasing the heritability, the correlation between TBV and GEBV was increased and the amount of excess was higher when departure was from 0.15 to 0.30 rather from 0.3 to .45 (Table 3 and Figure 1b).The high heritability equals with high contribution of gene effects in phenotypic variation and therefore the high accuracy.Many researchers have confirmed the high accuracy achieved for high heritabilities (Daetwyler, Villanueva, & Woolliams, 2008;Goddard, 2009;Howard et al., 2014).For high heritability traits (such as carcass characteristics in meat animals), the gene effects have great contribution in phenotypic variation and therefore provide more accurate evaluations.Therefore, performing selection would easily lead to progress in this kind of traits.
The marker density is the most easily controllable factor by the researcher.The higher accuracy of genomic breeding values was associated to more dense markers (Figure 1c).Doubling the marker density, the accuracy increased from 0.59 to 0.71.The more marker density was led to more coverage of the genome and high LD which in turn resulted to better estimation of QTL effects.In a simulation study, doubling the number of SNP was led to increase the accuracy from 0.63 to 0.73 (Piyasatian & Dekkers, 2013) which is in agreement with our results.the accuracy of GEBV due to increment of marker study was reported in many studies (Combs & Bernardo, 2013;Piyasatian & Dekkers, 2013;Solberg et al., 2008) Highest accuracy was observed when the interval between training and validating sets was as short as two generations (generation of validating set=3 th ).As time interval between training and validating sets prolonged, the accuracy of GEBV decreased.In a simulation study, accuracies were compared in the cases that validating set belonged to 1 th and 7 th generations and the accuracies were 0.73 and 0.60, respectively (Piyasatian & Dekkers, 2013).Accuracies decreased across generations because of the decay of genetic relationships that can captured by markers and changing of LD phase.In some researches, decreasing of genomic prediction accuracies over time due to breakdown of LD phase was reported (De Roos, Hayes, Spelman, & Goddard, 2008;Meuwissen et al., 2001).The recombination events break the LD phase during provision of gametes in each generation (Habier et al., 2009).In order to perform accurate genomic evaluations, marker effects should be re-evaluated at least every three generations (Hayes, 2007).
As mentioned above, the genomic accuracy increased along with increase in marker density, but the increment was different among three levels of heritability.At the low heritability (0.15), as the number of SNPs were doubled the accuracy increased 19 unites but at the high heritability (0.45), the excess only was 3 unites (Table 4).This suggested that dense SNP panels are more efficient in terms of accuracy, for traits with lower heritability.
The accuracy of genomic prediction decayed due to recombination effect across generations that change the LD phase between QTLs and markers.
The greater decline of accuracy was observed when the marker density was 500 and the decay of accuracy was decreased when the denser marker panels were used (Table 6).These results showed that using dense marker panels, the estimated marker effects of validating set could apply for longer time and it decreased the costs of reestimating of marker effects.Habier et al. (2007) reported that accuracy of genomic prediction decreased across generations but opposite to our results the amount of decay depended on marker estimation method; therefore, the lowest and highest declines were for TP-BLUP and Bayes-B2, respectively.As Table 5 shows, when heritability was 0.15, the accuracy of genomic evaluation for generation 7 of training set was 11 percent lower than the value for generation 3. The corresponding reductions for moderate and high heritabilities were 6 and 5 percent, respectively.As getting away from validating set, the accuracies decreased rapidly for low heritability case, suggesting that re-evaluating of marker effects is more imperative in the low heritability traits.

Conclusion
The results of this study showed that for models with additive gene action RKHS method did not perform better than parametric methods such as BayesA and BL, although the RKHS is more complicated and time consuming.Comparison of Acta Scientiarum.Animal Sciences Maringá, v. 38, 4, p. 447-453, Oct-Dec., 2016 these methods for non-additive models should be performed under different simulation and real data.Marker density is one of the most important factor that affect the genomic prediction accuracy and fortunately by new progresses in genotyping technologies, the high density SNP panels with low cost is available and could employ easily for getting accurate genomic prediction.Preventing decay of accuracy due to recombination across time was one of most important benefits of dense marker panels so when the highest number of markers (1000) were used, the lowest accuracy decay was happened.In this study, the decreasing trend of accuracy across generations was not affected by marker effect estimation methods.In high heritability traits, increasing the number of markers had slight effect on accuracy but for low heritability trait, increasing number of markers increased accuracy severely; therefore, using the dense marker panels is imperative for low heritability traits.There was the same association between heritability and the interval between validating and testing sets, so that getting away from validating sets decreased the accuracy of high heritability trait slightly but the decline was severe for low heritability trait.

Table 1 .
The r 2 values for different marker numbers.

Table 2 .
Output of Analysis of variance for accuracy.
H2= heritability, N_marker= number of marker, Interval= the interval between training and validating sets.

Table 4 .
The estimated genomic accuracy for three marker density and three levels of heritability.

Table 5 .
The estimated genomic accuracy for three validating sets and three levels of heritability.

Table 6 .
The genomic accuracy for three validating sets and three levels of marker density.