Self-organizing maps in the study of genetic diversity among irrigated rice genotypes

. This study presents self-organizing maps (SOM) as an alternative method to evaluate genetic diversity in plant breeding programs. Twenty-five genotypes were evaluated in two environments for 11 phenotypic traits. The genotypes were clustered according to the SOM technique, with variable topology and numbers of neurons. In addition to the SOM analysis, unweighted pair group method with arithmetic mean clustering (UPGMA) was performed to observe the behavior of the clustering when submitted to these techniques and to evaluate their complementarities. Genotype ordering according to SOM was consistent with UPGMA results, evidenced by the basic structure of UPGMA groups being preserved in each group of the maps. Regarding genotype arrangement and the group neighbors, maps involving five neurons presented inferior organization efficiency compared to the six-map arrangements in both environments. It was observed that the organization pattern among the rice genotypes evaluated by the maps was complementary to the UPGMA approach, as observed in all scenarios. It can be concluded that self-organizing maps have the potential to be useful for genetic diversity studies in breeding programs.


Introduction
Rice (Oryza sativa L.) is one of the most important crops in the world and is considered one of the main annual crops in Brazil.With the increase in population, demand has increased throughout the years, and it is estimated that by 2050 global rice production must increase from 60 to 110% to supply the population demand (Godfray et al., 2010;Tilman, Balzer, Hill, & Befort, 2011;Ray, Mueller, West, & Foley, 2013).However, this will only be possible as long as genetic variability is maintained.In Brazilian irrigated rice breeding programs, genotypic variability is restricted (Rabelo, Guimarães, Pinheiro, & Silva, 2015;Streck, Aguiar, Magalhães Júnior, Facchinello, & Oliveira, 2017); therefore, investigating the genetic diversity of rice genotypes is critical.
The rice cultivar indication process for commercial plantations is dynamic, and periodically new cultivars are recommended as substitutes for those less productive or with less commercial acceptance (Soares et al., 2008).New cultivar development is crucial to help increase food availability, and the success of breeding programs relies on the existence of genetic variability.Breeders have recommended the formation of a base population based on intercrossing the superior and genetically divergent cultivars.This is essential for the success of breeding programs (Cruz, Ferreira, & Pessoni, 2011).
One of the first steps in the formation of a base population is to guarantee genotypic variability through the morphological, physiological and molecular differences of genitors, generally expressed by a dissimilarity measure (Cruz, 2012).Genetic distance estimations are dependent on the data set available, as well as their phenotypic, genotypic, molecular or geographic features (Cruz et al., 2011).
Multivariate techniques such as the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), Tocher method, Principal Components Analysis (PCA) and Canonical Variables (CV) have been used as alternatives to simultaneous comparisons of qualitative and quantitative traits, resulting in more precise distance estimates and more accurate genetic diversity predictions among genotypes (Barbosa, Viana, Quintal, & Pereira, 2011;Preisigke et al., 2015).Another promising approach for genetic diversity studies is the self-organizing maps (SOM) method, which consists of a computational intelligence technique that allows for the visualization of similar patterns and data classification based on the distances between them (Kohonen, 2014).
SOM is a type of two-dimensional artificial neural network that organizes data from an unsupervised learning process and preserves notions of neighborhoods using Euclidean distance.The learning begins with the attribution of synaptic weights, and then a competition process is started in which each data sample is allocated to the neuron that best represents it.This neuron is called the "winner".Then the cooperation begins, in which the winning neuron determines the approximation of the other neurons in the order of proximity.Finally, the neurons that establish their neighborhood go to the adaptation phase, where there are weight adjustments.After all iterations, the map is organized in a topological structure that reflects the proximity of the elements under study.
SOM can present hexagonal topology, where each neuron has at most six direct neighbors, or quadratic topology with at most four direct neighbors.In addition, different arrangements are established that define the number of neurons available on the map.For example, a map with a two by three arrangement presents six neurons arranged in two columns and three rows.This technique is widely used in the various branches of science such as engineering (Akkiraju, Keskinocak, Murthy, & Wu, 2001), industry (Liukkonen, Laakso, & Hiltunen, 2013) and economics (Louis, Seret, & Baesens, 2013;Sarlin, 2013).Although self-organizing maps are widely used, this methodology is still relatively under-explored in plant breeding.
This work aims to test and present the self-organizing maps technique as an alternative method to evaluate the genetic diversity in plant breeding programs.

Material and methods
Twenty-five genotypes were evaluated (Table 1) from the irrigated rice breeding program of the Empresa de Pesquisa Agropecuária de Minas Gerais (EPAMIG), in partnership with Embrapa Arroz e Feijão, of which five were checks (Rio Grande, Ourominas, Seleta, Predileta, and Rubelita).Experiments were carried out in lowland soils under continuous flooding conditions in a randomized block design with three replications in the harvest of 2012/2013, in two municipalities of the state of Minas Gerais: Leopoldina (21º31'12"S, 42º38'43"W) and Lambari (21º58'32"S e 45º21'01"W).The experimental plots consisted of five-meter plant rows with 0.30 m row spacing, in a total plot area of 7.5 m 2 and with a useful area of 3.60 m 2 .The plant density was 300 seeds m -2 .The agronomic traits evaluated were grain yield (kg/ha), plant height (cm), flowering (days), 100 grain weight (g), grain size (length, width and thickness), length/width ratio, panicle length, number of full grains/panicle, full grains percentage and number of stems/m².All cultural practices were carried out as recommended for the culture (Borém & Nakano, 2015).
Joint analysis of variance was performed for each trait, considering the effects of genotypes to be fixed and environments to be random, according to Equation 1: (1) For the genetic diversity study, dissimilarity matrices for each environment were obtained based on the average Euclidean distance.The clustering method using the conventional statistical approach was the Unweighted Pair Group Method with Arithmetic Mean (UPGMA).The Mojena (1977) criterion was used to define the optimal number of dendrogram groups, adopting k = 1.25.To control cluster consistency and quality, the cophenetic correlation coefficient (CCC), the distortion between the dissimilarity matrix and the matrix obtained after dendrogram (graphical matrix), and the stress (adjustment precision obtained with the projection of dissimilarity matrix in the dendrogram) were obtained.The CCC was given by the correlation between the elements of the dissimilarity matrix and the elements from the matrix produced by the phenogram (cophenetic matrix) (Silva & Dias, 2013).
The genotypes were also clustered according to the technique of unsupervised learning machine of selforganizing maps.The replication averages for each genotype evaluated for all 11 variables in each assay were used as inputs for this approach.No outputs were stipulated a priori for each genotype, because this is an unsupervised technique.To evaluate the SOM consistency and the best configuration to be used when performing the clustering, eight scenarios were established that varied according to the number, the neurons conformation and the topology in use in the system.For each experiment, eight scenarios were tested in a total of 16 analyses.
Five or six neurons were initially adopted as possible centroids of the genotype groups to be formed.Because this technique is influenced both by the number of neurons adopted a priori, as well as the arrangement of these neurons, this study also evaluated whether there were any arrangements that allowed better visualization of the diversity among the genotypes.These arrangements varied with the number of neurons, and whether that amount was a prime number or not.Thus, in scenarios for which a prime number of neurons was adopted, such as the arrangements containing five neurons, only one by five or the inverse could be used.In the condition of six neurons, a larger number of possibilities could be used; however, only the arrangements of two by three or the inverse were adopted.The decision was made not to use the other possible arrangements because they resembled the arrangements with five neurons.
Although Kohonen (2014) affirms that the hexagonal topology allows for better visualization of the general data structure, in addition to minimizing the errors, it is not yet known if there is more adequate topology that could be used in genetic diversity studies.Therefore, in this study we tested the grid or hexagonal topologies.
For SOM processing in the different scenarios, the standardized average Euclidean distance was used.For the iterative process, the number of 1000 iterations for each scenario was stipulated.The software Matlab (Matlab, 2010) and GENES (Cruz, 2012) were used to perform the analysis.

Results and discussion
The F test revealed significant effects for genotypes (p < 0.05) for the following traits: number of full grains per plant, full grains percentage, grain width and grain length/width ratio (Table 2).The coefficients of variation estimated for most traits were compatible with those obtained in other studies of rice (Silva, Silva, Guimarães, & Moura, 2011;Hosan, Sultana, Iftekharuddaula, Ahmed, & Mia, 2010;Streck et al., 2017), emphasizing the acceptable test quality.
The genotype x environment (G x E) interaction effect was significant for grain yield, panicle length, length, width, and grain thickness, in addition to the grain length / width ratio, indicating that the genotypes exhibit differential behavior in the evaluated environments for these traits.Therefore, the decision was made to study the genetic diversity among the genotypes separately for each environment, because the clustering pattern may vary according to environmental variation.
For the municipality of Leopoldina (Figure 1A), the UPGMA clustering presented CCC values of 0.7306, with a distortion of 2.53% and a stress of 15.90%.For Lambari (Figure 1B), the CCC, distortion and stress values were 0.7025, 4.02% and 20.07%, respectively, showing adequate adjustment values between the dissimilarity matrices and dendrograms.By means of the global criteria introduced by Mojena (1977), it was observed by reference to the last fusion level that at the 80% level of similarity, five different groups were formed for both dendrograms.Although the number of groups was the same, the clustering method gathered the genotypes differently in each environment, a result expected due to the significant GxE interaction for most evaluated traits.Despite this, the genotypes BRA 041099, MGI 0712-1, MGI 0901-5, MGI 0713-17, RUBELITA, MGI 0607-1, BRA 02706, BRA 02708, MGI 0517-25, OUROMINAS, and RIO GRANDE remained together in one unique group, independent of the environment (Figure 1A and B), reinforcing the idea of greater similarity among them.
In addition to cluster analysis under the conventional biometric approach, a spatial ordering of genotypes was also performed through self-organizing maps, which emphasizes the use of computational intelligence looking for an optimal solution (Kohonen, 1990).Smith and Ng (2003) affirm that it is difficult to quantify the efficiency of clustering made from SOMs, but they were able to generate clearly distinguishable groups.
Considering the experiment conducted in Leopoldina, it was verified that the maps with five neurons (scenarios one, two, three and four) showed genotypic organization with high agreement among each group (Table 3).Considering this configuration, the scenarios one, three and four used the same organization pattern for all genotypes.Only clusters one, two and three, belonging to scenario two, diverged in relation to the other scenarios with quadratic topologies.However, even if this variation existed, there was some agreement between these groups, because in all scenarios the genotype pairs six -18 and eight -13, for example, remained in the same group.The genotypes that diverged most in terms of classification in all the maps for this environment were the 16, 19, and 24 genotypes.
In Lambari, map results for the scenarios with five neurons were similar (Table 4).In this case, the genotypes 16, 21, and 22 were those that presented divergence regarding the allocated groups.In general, it has been observed that the maps with the one by five and five by one arrangements present similar classification results, mainly because they are very simple and because the topology probably will not interfere in these configurations.In scenarios five, six, seven and eight, the map results presented low divergences among themselves, and in scenarios six, seven and eight, the genotypes were allocated to identical groups.Only groups two and three from scenario five differed from the other scenarios, with emphasis on genotypes 21 and 23, which had a distinct allocation pattern in other scenarios with five neurons.This result highlights the possibility of anomalies that occur due to this being an iterative process, and due to the genetic nature of these related and very similar genotypes.3, 4, 5, 7, 10, 11, 16, 17, 23 4 2, 3, 4, 5, 7, 10, 11, 16, 17 , 3, 4, 5, 7, 10, 11, 17, 23 5 21 6 6, 12, 13, 15, 25 6 6, 12, 13, 15, 25 Although different clustering techniques provide different diversity views, an agreement is expected to exist among them.Therefore, the results obtained by the self-organizing maps were compared to those obtained by the conventional statistical approach, with a main goal to observe the clustering behavior associated with these techniques and evaluate their complementarities.
The genotypes were identified in each map according to the UPGMA clustering.Genotype ordering according the SOM technique was consistent with the hierarchical clustering results, because the basic structure of the Acta Scientiarum.Agronomy, v. 41, e39803, 2019 UPGMA groups was preserved in each group of the maps (Figures 2 and 3).Considering genotype arrangements and the group neighbors, maps involving five neurons presented inferior organization efficiency to the six-map arrangements in both environments, as can be observed in Figure 2 where the genotypes eight, 13 and 16 were separated into groups without neighborhoods in scenarios with five neurons, but remained in neighboring groups in all scenarios with six neurons.
Figure 2. Arrangement of the clustering scenarios according to the SOM for Leopoldina.Genotypes belonging to the same group according to the UPGMA method were identified on maps using equal colors (i.e., numbers in blue, green, red, black, and purple represent a specific UPGMA cluster).Numbers in parentheses refers to their respective scenarios.
In scenarios with five neurons, each group has at most two direct neighbors, while in the six-neuron arrays the neighborhood is determined for up three groups; therefore, the technique is able to capture the proximity among groups and to organize the genotypes more efficiently in each one according to the grouping carried out.In addition, Figure 3 shows that genotypes one and nine of scenarios two, three, four, six and eight were distinctly allocated to the groups consisting of their peers according to UPGMA clustering.According to Kohonen (2014), hexagonal topology allows for better neuron arrangements; moreover, the simpler configurations like those used with five neurons may distort genotype organization, especially in cases where the material Acta Scientiarum.Agronomy, v. 41, e39803, 2019 studied is not easily distinguishable.
Figure 3. Arrangement of the clustering scenarios according to the SOM for Lambari.Genotypes belonging to the same group according to the UPGMA method were identified on maps using equal colors (i.e., numbers in blue, green, red, black, and purple represent a specific UPGMA cluster).Numbers in parentheses refers to their respective scenarios.
Having a fixed number of groups that depends on the number of neurons that are predefined when determining the SOM configuration can lead to some abnormalities, such as the separation of some large groups into two or more smaller subgroups and group union.In addition, it should be noted that the rice genotypes evaluated in these assays are in the final stage of breeding and are genetically close, which may lead to some difficulty in allocating these genotypes.However, in general, it is observed that the organization patterns among the rice genotypes evaluated by the maps is complementary to the UPGMA approach, as observed in all scenarios.
When evaluating the map arrangement in different scenarios, it was noted in Leopoldina that a three by two arrangement with hexagonal topology preserved the organization of UPGMA groups, a fact that was not observed in scenario five (two by three arrangement), where the genotypes 23 and 24 were allocated to distant groups (Figure 2).In Lambari, the three by two and two by three arrangements with the same topology were superior to the others because they permitted more connections among groups.These topologies were able to organize the genotypes into groups closer to those obtained by UPGMA clustering (Figure 3).In particular, genotypes of the same group in the UPGMA remained in the same group or in linked groups in the SOM.However, in cases with higher numbers of neurons, there is a possibility of larger variation amounts in the allocation of genotypes, as affirmed by Kohonen (2014).
The SOM method has been shown to be an efficient way of identifying patterns of similarity, as shown by Mwasiagi (2011), who used the SOM technique to distinguish cotton genotypes.This author concluded that the method was efficient for separating thin wires from coarser ones, and the samples that were dispersed on the map would be outliers, implying irregularity of the material.Smith et al. (2003), studying the SOM efficiency for organizing web pages through navigation patterns, obtained a satisfactory result and concluded that the method can be easily incorporated; however, this needs to be developed for large scale applications.A similar conclusion was found by Fritzke (1994), who studied the map efficiency for supervised and unsupervised learning.
In addition to the strong agreement obtained by the SOM, this technique presented high complementarity to the stochastic approach.Thus, it is observed that the organization of genetic diversity through self-organizing maps is efficient, and the SOM technique has the potential to be useful for genetic diversity studies in breeding programs.

Conclusion
Self-organizing maps have the potential to be useful for genetic diversity studies in breeding programs.

Figure 1 .
Figure 1.Dendogram obtained by dissimilatity matrix of Euclidian distance of 25 rice genotypes, using Unweighted Pair Group Method with Arithmetic Mean clustering (UPGMA) for the municipalities of Leopoldina (A) and Lambari (B).

Table 1 .
Rice genotypes origin and codification.
ns, ** and *: Not significant, significant according to an F-test at the 0.05 and 0.01 probability level, respectively; SV: sources of variation CV: coefficient of variation; DF: degrees of freedom.

Table 3 .
Clustering of 25 rice genotypes in the municipality of Leopoldina according self-organizing maps.

Table 4 .
Clustering of 25 rice genotypes in the municipality of Lambari according self-organizing maps.