Computational intelligence to study the importance of characteristics in flood-irrigated rice

ABSTRACT. The study of traits in crops enables breeders to guide strategies for selecting and accelerating the progress of genetic breeding. Although the simultaneous evaluation of characteristics in the plant breeding programme provides large quantities of information, identifying which phenotypic characteristic is the most important is a challenge facing breeders. Thus, this work aims to quantify the best approaches for prediction and establish a network of better predictive power in flood-irrigated rice via methodologies based on regression, artificial intelligence, and machine learning. Multiple regression, computational intelligence, and machine learning were used to predict the importance of the characteristics. Computational intelligence and machine learning were notable for their ability to extract nonlinear information from model inputs. Predicting the relative contribution of auxiliary characteristics in rice through computational intelligence and machine learning proved to be efficient in determining the relative importance of variables in flood-irrigated rice. The characteristics indicated to assist in decision making are flowering, number of grains filled by panicles and length of panicles for this study. The network with only one hidden layer with 15 neurons was observed to be efficient in determining the relative importance of variables in flooded rice.


Introduction
Plant breeding is effective in increasing the productivity of crops.The primary objective of plant breeding is to increase the frequency of good alleles in plant populations such that superior crops are developed with high productivity, resistance to diseases and pests, tolerance to abiotic stresses, and superior adaptation to environments (Yu, Campbell, Zhang, Walia, & Morota, 2019).
In general, productivity prediction is performed using multiple linear regression.Although interesting, multiple regression models have some limitations, such as the size of the sample data.Specifically, when the observation number is less than the number of parameters, it is not possible to obtain the estimates using the usual estimation methods.Additionally, such models do not allow the adjustment of complex nonlinear relationships possibly existing in some data sets.Artificial neural networks (ANNs) provide an interesting alternative because they can capture nonlinear relationships between predictors and responses (Gianola, Okut, Weigel, & Rosa, 2011;Skawsang, Nagai, Nitin, & Soni, 2019) and ignore assumptions in the data sets.
The application of artificial intelligence, such as ANN, allows the capture of nonlinear effects among the data set and has been used in studies of prediction in plant breeding (Silva et al., 2014;Silva et al., 2017;Sant'anna et al., 2019).However, although ANNs are powerful predictive tools compared to conventional models, such as multiple linear regression (Paruelo & Tomasel, 1997;Olden & Jackson, 2002;Beck, 2018), they have the limitation of neglecting to quantify the importance of the variables.
Quantifying the importance of variables for prediction in breeding programmes allows for faster progress, selecting and predicting characteristics that have low heritability and/or measurement difficulty.Although simultaneous evaluation of characteristics provides a wide variety of information, identifying which predictor variable is most important is a challenge for breeders (Parmley, Higgins, Ganapathysubramanian, Sarkar, & Singh, 2019).The quantification of the importance of variables can be performed by ANNs through algorithms such as Goh (1995), who proposed a modification in Garson's (1991) algorithm that consists of partitioning the neural network connection weights to determine the relative importance of each variable entering the network.
Other interesting alternatives for studies of the prediction and importance of variables are methodologies based on machine learning, such as decision trees (Beucher, Møller, & Greve, 2019;Parmley et al., 2019) and their refinements, such as bagging (Degenhardt, Seifert, & Szymczak, 2019), random forest, and boosting (Degenhardt et al., 2019).Such methodologies allow good predictions and the importance of the characteristics to be obtained through measures based, for example, in the index of Gini and Entropy (Hastie, Tibshirani, & Friedman, 2009).These methodologies enable the quantification of the impact of the disruption or disturbance of the input information on the estimate of the determination coefficient.
Methodologies based on regression, artificial intelligence, and machine learning have been used successfully in a prediction study.Parmley et al. (2019) evaluated the phenotypic characteristics of high dimensionality soybeans through a machine learning approach to predict seed yield regarding the prescriptive development of cultivars for agricultural practices.Skawsang et al. (2019) applied such methodologies to predict the population of insect pests using climatic and phenological factors of the host plant.However, there are no studies in the literature related to yield prediction and verification of the importance of variables for grain yield in rice culture.Unlike the methods of regression, artificial intelligence and machine learning do not make any prior assumptions about the data structure, in which it captures linear and nonlinear dependencies between the predictor and the response variables, making it a suitable tool for the researcher.
Given the above, this work aims to i) predict grain yield, grain length and width ratio, and panicle length in flood-irrigated rice through regression, artificial intelligence, and machine learning methodologies; ii) quantify the best approaches to prediction; and iii) establish a network of better predictive power in floodirrigated rice.
The evaluated characteristics were grain yield (GY, kg ha -1 ), panicle length (LP, cm), and grain length-towidth ratio (LGW), which were used as response variables and the others as explanatory variables (inputs), that is, plant height (HP, cm), flowering (FL, days), lodging (LO), number of full grains per panicle (GP), percentage of full grains (FG, %), tillering (TI), length (GL, mm), width (GW, mm) and thickness (GT, mm) of grains, and weight of 100 grains (WG, g).They were used to compose artificial neural networks of genotypes of flood-irrigated rice in the State of Minas Gerais.

Multiple regression
Multiple regression, through the stepwise strategy (Ghani & Ahmad, 2010), was used to predict the variable responses to grain yield, panicle length, and grain length-to-width ratio as a function of the other measured variables and was considered to be explanatory.The adopted model is represented by Equation 1: where  is the response variable (grain yield, panicle length or grain length-to-width ratio),  1 a   are the explanatory variables,  0 represents the intercept,  1 e   are the linear coefficients associated with  1 a   , and  residual effect.The estimate of the coefficient of determination R 2 was used to verify how much of the independent variable is explained by the total variation of the dependent variable.
The description of R 2 is found in Equation 2: where  is the observed values, and  ̂ is the predicted.

Artificial intelligence
For better network efficiency, before training and validation, the data were normalized in the range between -1 and 1.The training data set, in each location, was established by 2/3 of the phenotypic information, using the strategy of aggregating information from two of the three repetitions for training and the information from the other repetition used as a validation set.In this cross-validation strategy, individuals from each repetition participated at least once in the validation data set in cross-validation (kfold) k = 3 partitions.

Perceptron Multilayer -PMC
The maximum number of training seasons was set at 5,000; the mean square error (MSE), as a criterion to stop processing the network, was defined as 1.0 × 10 −3 .All trained networks had a neuron in the output layer and a single hidden layer, with 15 neurons.The sigmoid tangent activation function was used in the hidden layer, and the training algorithm was Bayesian regulation backpropagation.To quantify the efficiency of the prediction R 2* .

Importance of variables
To quantify the importance of variables through the PMC network, two techniques were used.The first is based on the Garson (1991) algorithm modified by Goh (1995) (AG), which consists of partitioning the neural network connection weights to determine the relative importance of each input variable within the network.This algorithm describes the relative magnitude of the importance of the descriptors (predictor) in their connection with outcome variables through the dissection of synaptic weights from the neural network.In the second technique, the importance of variables (inputs) is assessed through the impact of the disruption or disturbance of the information of a given input on the estimation of the determination coefficient.Thus, this importance is estimated by exchanging information or by making constant the phenotypic values shown for each variable and verifying changes in the estimates of the R 2 .When we disturb the values of a variable and R 2 decreases, there is an indication that the input variable is important about the others for purposes of prediction with the network already established.

Radial Base Function network -RBF
The radial base function network is characterized by having only one hidden layer and making use of the Gaussian activation function (Cruz & Nascimento, 2018).The structure of the RBF to better predict grain yield, panicle length, and grain length-to-width ratio was established with 10 to 30 neurons (increased by 2, with each processing), and the radius established between 5 and 15 increased by 0.5.The efficiency of the prediction was measured by the R 2 , and the relative importance of each entry was measured by the technique of destroying the information of each explanatory variable, as already described for PMC.

Machine learning
To predict grain yield, panicle length, and grain length-to-width ratio and quantify the importance of variables through a machine learning approach, a decision tree and its refinements were used, random forest, bagging, and boosting.The R 2 measured the quality of the predictive model fit, and information from the minimum quadratic error (MSE) was used to quantify the importance of variables in flood-irrigated rice crops.The minimum square error was estimated as described in Equation 3 below: where   and  ̂ correspond to the observed and predicted values of observation in genotype i, respectively, and n is the total number of observations (variable, depending on the environment analysed).In these techniques, the importance of the explanatory variable is the quantification of the mean decrease in the prediction precision, which consists of the estimate of the percentage of increment of minimum square error (IMSE), which is constructed when we exchange the values of each variable of the data set and are Acta Scientiarum.Agronomy, v. 45, e57209, 2023 compared with the prediction of the original unchanged data set for the variable.Analogous to the regression analysis, it is the average increase of the squares of the residuals of the data set when the variable is exchanged (Li & Zhan, 2019).Higher values of IMSE represent the importance of the highest variable.For better efficiency of the prediction estimate of the importance of variables, 5,000 trees were generated.
The analyses were performed with the aid of R software using the NeuralNetTools (Beck, 2018) and Genes (Cruz, 2016) packages, which use an interface with MATLAB software (Matlab, 2016).

Prediction by different approaches
The estimate of R 2 for all methodologies using the explanatory variables to predict grain yield (GY), panicle length (PL), and grain length and width ratio (LGW) in flood-irrigated rice is shown in Figure 1.Based on Figure 1, it is possible to compare and define the variables that proved to be most efficient for the prediction of GY, PL, and LGW.Higher values of this estimate indicate that the target prediction variable has a better adjustment than the other explanatory variables (Roy & Roy, 2008;Hassanzadeh, Ghavami, & Kompany-Zareh, 2015).Among the methodologies used in this study, it was found that multiple regression showed a lower estimate of R 2 (Figure 1) for the same variable, indicating the existence of nonlinear associations between the explanatory variables not considered in the model.Artificial intelligence and machine learning methodologies, in turn, stood out for their ability to extract nonlinear information from model inputs (Parmley et al., 2019;Skawsang et al., 2019), as seen in Figure 1.Other authors have already highlighted the abilities of neural networks to better capture nonlinear relationships when compared to conventional methodologies (Silva et al., 2014;Sant'anna et al., 2016).The results obtained by different approaches show that there was a discrepancy between the maximum estimate of R 2 for all predictive variables in the same environments (Figure 1).The artificial intelligence approach in the Leopoldina environment provided a higher estimate for the predictive variables PL and GY in the RBF procedure, 83.44 and 78.90%, respectively.The GY response variable had the best estimate of R 2 in the Lambari and Janaúba environments in the PMC network with only one neuron in the output layer and a single hidden layer (Figure 1).In the Leopoldina and Lambari environments, for the LGW response variable, a maximum estimate of R 2 was approximately 100% by multiple regression and artificial intelligence approaches.
On the other hand, it is variable in Janaúba, with a maximum estimate of 62%.The differences in the results obtained in these analyses indicate that the environment influences the estimation of R 2 and, consequently, the cause and effect relationships between the response variable and the set of explanatory variables.
Machine learning approaches proved to be more efficient than other approaches (Figure 1).There was a low estimate of R 2 for the predictive variable GY in the Janaúba environment in the random forest procedure, which corresponds to 18.57%.This result is inferior to all the approaches used in this study.In this same environment, but for bagging procedures, the estimate of R 2 was 94.76%.High estimates of R 2 above 80%) were obtained using machine learning methodologies by the procedures bagging and boosting for all predictive variables (Figure 1).The decision tree (AD) and random forest methodologies did not stand out from the other machine learning procedures (Figure 1).Sousa et al. (2020) emphasized that the AD's low predictive accuracy can be improved using ensemble methods such as bagging, random forest, and boosting.These strategies combine multiple AD to reduce the variability.
Random forests and bagging these methods have good predictive performances in practice; they work well for high-dimensional problems and can be used with multiclass output, categorical predictors, and imbalanced problems (Gregorutti, Michel, & Saint-Pierre, 2017).This author had satisfactory result variable selection with the random forests algorithm in the presence of correlated predictors.
When the variables are correlated, the simple correlation coefficient produces incomplete information.This is because a high correlation between two variables may have resulted from a third or a group of variables over another variable.Traditional methods, as well as path analysis, decompose into direct and indirect effects on the main variable, and logistic regression becomes unstable in the presence of high correlations.Multicollinearity is caused by the high correlation between the variables, which provides a problem of lack of adjustment of the model that affects the estimates of the parameters.In the literature, the ability of RNA to circumvent the problem of multicollinearity has already been highlighted (Cruz & Nascimento, 2018).These authors presented an application in which a response variable is predicted through five explanatory variables.By including a sixth explanatory variable, which would assume the same values as the fifth variable, it did not affect the accuracy of the ANN -Adaline in any way.However, they reinforce that in the classic multiple linear regression approach, there would be no solution, since there would be two columns, in the prediction matrix X, linearly dependent, so that the established multicollinearity would lead to an X'X matrix without a common inverse.
The efficiency of ANNs in prediction problems, given their ability to extract relevant information from large data sets and generalize relatively inaccurate information (Porwal, Carranza, & Hale, 2003), was very well expressed by the results obtained (Figure 1).The same can be seen for methodologies based on machine learning, which are capable of handling more reduced or redundant information in the input variables (Quinlan, 1996).However, another study as important as the prediction and which is often not carried out is the identification, among the explanatory variables, those of greater importance despite constituting important information in the process of understanding the adjusted model and decision making about dimensionality reduction in future studies (Beucher et al., 2019).Thus, after the prediction analysis, the quantification of the importance of variables was performed using artificial intelligence and machine learning methods to identify, among the set of explanatory variables, those that should be prioritized and identified as auxiliary characteristics in indirect responses to selection.

Importance of variables in prediction by the artificial intelligence approach
For ease of interpretation, we will denote R 2 the quality of prediction of the methodology and R 2* this same quality of adjustment after the disturbance in the explanatory variable.

Multilayer Perceptron (PMC)
Neural networks tend to perform well when compared to other predictive algorithms based on machine learning (Santos, Dean, Weaver, & Hovanski, 2018).These algorithms are capable of learning from linear and nonlinear relationships in the data (Somers & Casal, 2009;Haddouche, Chetate, & Said Boumedine, 2018).It can also measure and incorporate direct effects and effects of interaction between variables in predictive models (Tsang, Cheng, & Liu, 2017).
The PMC network is widely used in the predictive process (Gedeon, Wong, & Harris, 1995;Santos et al., 2018) since the success of this network has already been shown in several research groups that have shown mathematically that, with only a single hidden layer, this network works very well with different numbers of neurons in the hidden layer (De Oña & Garrido, 2014;Santos et al., 2018).
Acta Scientiarum.Agronomy, v. 45, e57209, 2023 The importance of the variables was quantified by assigning a zero value to the phenotypic information related to each variable to observe what changes would occur in the values of the R 2* .The results of the PMC network are shown in Table 1.It is important to note that, in this table, reductions in the values of R 2* after assigning zero value to the phenotypic information referring to each variable, they are indicative that this variable is important about the others for purposes of prediction with the network already established.LO: lodging, HP: height (cm), GL: grain length (mm), PL: panicle length (cm), GT: grain thickness (mm), FL: flowering (days), GW: grain width (mm), GP: number of filled grains per panicle, WG: weight of 100 grains (g), TI: tillering, FG: percentages of filled grains and LGW: length-to-width ratio of grains, GY: grain yield, Environment, E1: Leopoldina, E2: Lambari, E3: Janaúba.
The results in Table 1 show great discrepancies in the R 2* when comparing the environments with each other, which makes interpretation difficult.For the response variable LGW, it was efficient to quantify grain length and width due to the reduction in the estimate of R 2* as a result of the strategy of assigning a zero value to phenotypic information.It should be remembered that such changes must be seen concerning the values of the R 2 of prediction, which was approximately 100% in the environments of Leopoldina and Lambari, and Janaúba was 63% (Figure 1).For Leopoldina, when zeroing the variables, for example, HP, GL, and TI, the R 2* values of these variables were 0.04, 0.52, and 1.70, respectively (Figure 1).This result shows that these variables are important in predicting GY because the disturbance in their values has led to a considerable reduction in the quality of the adjustment.In Lambari, the variable that presented the highest contribution was FL.Independent of the predictive variable in PMC, with only one neuron in the output layer and a single hidden layer, they agreed to point out that the most important variables were grain width and length, given the significant falls in the values of the estimate of R 2* observed when zeroing the variable.
To overcome the difficulties faced when adopting PMC networks to study the importance of variables, an alternative is to use the AG algorithm, which takes into account the partitioning of the RNA connection weights to determine the relative importance of each input variable within the network.The weights that connect neurons in an ANN are partially analogous to the coefficients in a generalized linear model (Beck, 2018) so that the combined effects of weights in the model's predictions represent the relative importance of predictors in their associations with the variable of the predictor.The large number of adjustable weights in an artificial neural network makes it very flexible in modelling nonlinear effects but imposes challenges for its interpretation.In this algorithm, the numbers of neurons were used to obtain the maximum estimate of R 2* for a better estimate of the relative contribution of variables.
The percentages of the relative contribution estimated by the GA method are described in Table 2.In this table, for the GY response variable, the results were consistent in pointing plant height (HP), flowering (FL), and the number of full grains per panicle (GP) in terms of relative contribution.For the variable response PL, the variable with the greatest relative contribution was grain yield (GY) in the environments of Leopoldina and Lambari; however, in Janaúba, the variable that stood out was the length and width of grains.Regarding the explanatory variable LGW, the percentages of the relative contribution revealed that the variables grain length and grain width had the largest relative contribution.This result was expected since the length and width of grain variables are determinants of LGW.The results indicate that the GA approaches are efficient in quantifying the importance of variables in studies involving PMC neural networks.

Radial Base Network (RBF)
The quantification of the importance of flood-irrigated rice characters by assigning a zero value to the information of an input variable after the RBF was established was performed and is described in Table 3.In this table, the values are used after causing disturbances in the input variables with the action of assigning zero value of the variable in each explanatory variable.When using this strategy of zeroing the value of the variable, drastic reductions in the values of R 2* were observed for the most important length (GL) and grain width (GW) variables when the target prediction variable was LGW.For other response variables, this result was very discrepant in quantifying the true importance of variables.When the explanatory variable was GY, in Janaúba, the variables that suffered the greatest reduction in R 2* were flowering -R 2* = 23.80 and weight of 100 grains (WG) -R 2* = 19.91; in Leopoldina, plant height variables were observed (HP) -R 2* = 21.26,grain width (GW) -R 2* = 24.83 and weight of 100 grains (WG) = 24.25; and in Lambari, the most important variable using this approach was flowering (FL) -R 2* = 28.43.
For the variable response PL, we observed changes in the values of R 2* in Leopoldina and Lambari for the variable flowering (FL) -R 2* = 47.77 and R 2* = 46.76,respectively.In Leopoldina, the percentages of full grains (FG) -R 2* = 25.51 also showed a drastic reduction in R 2* .In Lambari, lower estimates of R 2* were obtained for the variable weight of 100 grains (WG) -R 2* = 45.60.For Janaúba, the results show that the most important variables using the RBF were grain width (GW) -R 2* = 19.76 and weight of 100 grains (WG) -R 2* = 23.11.

Importance of variables in prediction by the machine learning approach
Table 4 shows the averages of the relative contributions of the explanatory variables for predicting grain yield, panicle length, and grain length-to-width ratio by estimating the percentage of minimum square error increment (IMSE), which is constructed by exchanging the values of each variable in the data set and comparing it with the prediction of the original unix exchange data set for the variable.In this case, unlike the strategy used for the computational intelligence methodologies of PMC and RBF networks, for which lower values of R 2* indicated greater importance of that variable for the model, in the machine learning approach, the importance of the explanatory variable is related to the estimate of the average decrease in the accuracy of the model through IMSE so that the higher this estimate the greater the importance of the variable.
Table 4.The average estimate of the relative contributions of the explanatory variables for predicting grain yield, panicle length, and grain length-to-grain ratio in flood-irrigated rice continues using a machine learning approach in three environments in Minas Gerais.
PV: predictive variable, LO: lodging, HP: height (cm), GL: grain length (mm), PL: panicle length (cm), GT: grain thickness (mm), FL: flowering (days), GW: grain width (mm), GP: number of filled grains per panicle, WG: weight of 100 grains (g), TI: tillering, FG: percentages of filled grains and LGW: length and width ratio of grains, GY: grain yield.Environment E1: Leopoldina, E2: Lambari, E3: Janaúba, FA: random forest, BA: bagging, BO: boosting.Based on Table 4, the variables that obtained the highest estimate in all machine learning methodologies were length (GL) and grain width when the prediction target variable was grain length and width ratio (LGW) in all environments.For this same response variable, another variable that had a high IMSE estimate was panicle length (PL) in Leopoldina and Lambari, and Janaúba did not consider this variable to be the most important due to the low estimate of the IMSE percentage.On the other hand, the weight variables of 100 grains (WG) and the number of full grains per panicle (GP) proved to be efficient in quantifying the prediction of LGW by boosting.This procedure proved to be more consistent in predicting variables compared to the others.
The variable that obtained the highest IMSE estimate when PL was the target prediction variable was plant height (HP) for Leopoldina and Lambari.On the other hand, this variable in Janaúba was not highlighted in predicting PL.In Leopoldina, another variable that stood out in predicting PL was the number of grains filled per panicle (GP) for all machine learning approaches.When using the explanatory variable PL, the variable GY presented the highest IMSE in Janaúba for procedure bagging.Regarding the procedure boosting and about the same predictive variable, the results show discrepancies.On the other hand, this procedure was more consistent in predicting the variable.In this procedure, to quantify the importance of a variable using PL as a predictive target, the variables GP, GY, and LGW stood out in Leopoldina.In Lambari, variables showed better performance in predicting PL, for example, GW, GY, and LGW, and in Janaúba, they were PL, GW, GP, GY, and LGW.
When the target prediction variable was GY, in Leopoldina, the variables that obtained an estimate of the high IMSE percentage were plant height (HP) and grain length (GL) in all machine learning procedures.On the other hand, in Lambari, the variable that stood out was panicle length (PL).In this environment, another variable that showed better predictive performance when GY was used as the main variable was flowering (FL) in bagging and random forest.In the boosting procedure, the variables that stood out were HP, GL, PL, GP, WG, and LGW in all environments.
The literature has highlighted machine learning techniques as efficient tools in quantifying the relative importance of variables, in view of simplicity, the nonuse of assumptions about the distribution of explanatory variables, and their robustness to quantity, redundancy, and environmental influences (Tan et al., 2014;Beucher et al., 2019).On the other hand, we verify this premise for the regression method.Random forests and bagging these methods have good predictive performances in practice; they work well for highdimensional problems and can be used with multiclass output, categorical predictors, and imbalanced problems (Gregorutti et al., 2017).This author had satisfactory result variable selection with the random forests algorithm in the presence of correlated predictors.
Grain yield is a trait controlled by several genes and is therefore a quantitative inheritance (Freitas et al., 2007).Therefore, grain yield depends on the interaction of several yield components, for example, numbers of spikelets and grains per panicle, mass of a thousand grains, spike fertility index and panicle length, which are controlled by genetic factors, and environmental factors.The length of the panicle, the number of spikelets per panicle, the fertility of the spikelets, and the mass of a thousand grains directly affect grain yield (Evans & Bhatt, 1977).Thus, knowledge of these relationships can help breeders select new cultivars, which can increase the productivity and quality of grains and decrease the cost of production and the environmental impact.
The longer the flowering period in the rice culture, the more photoassimilates are produced and translocated to the grains, and consequently, an increase in grain yield.However, late-cycle cultivars tend to be more productive about the early cycle since they obtain an increase in the amount of photoassimilates that are translocated to the grains.According to Ntanos and Koutroubas (2002), productivity in rice has been justified by differences in the dynamics of the distribution of assimilates between organs during plant growth and development.From the results of these studies, it was found that the production of dry matter and the translocation of photoassimilates contributed significantly to the development of grains in different cultivars and, consequently, a direct relationship with grain yield.
Grain dimensions are the main determinants of grain weight and one of the three components (number of panicles per plant, number of grains per panicle, and weight of grains) of grain yield; therefore, they are important characteristics that affect yield in rice.In plant breeding applications, grain size is generally assessed by the weight of the grain, which is positively correlated with various characteristics, including the length, width, and thickness of the grain (Fan et al., 2006).These characteristics also influence acceptability for consumers, and therefore, the size/shape of the rice grain is an important preferential target characteristic for breeders (Huang et al., 2012;Anacleto et al., 2015).Cultivars of the short and long types are highly preferred by many consumers in Japan, South Korea, and North China, while consumers in India, the USA, and other countries in South and Southeast Asia prefer long and medium grains (Misra et al., 2017).
Methodologies based on machine learning and computational intelligence do not depend on stochastic information and tend to be more efficient.These methodologies make no assumptions about the model but capture complex factors such as epistasis and dominance in prediction models.It is not necessary to know if the data have these effects and do not require any assumptions about the distribution of phenotypic values (Sousa et al., 2020).Machine learning algorithms have the advantage of modelling data in a nonlinear and a nonparametric manner (Osco et al., 2020).Unlike many traditional statistical methods, these algorithms are with the advantage of dealing with noisy, complex, and heterogeneous data (Osco et al., 2019).In this study, we compare different approaches to quantifying the importance of variables to identify relevant predictive variables within a regression problem.Additionally, we included in our comparison a traditional method that aims to find a small subset of important variables with ideal forecasting performance in flood-irrigated rice.
It is noteworthy that the 13 characteristics used in this study are laborious to obtain, and their evaluation can be costly if there are a greater number of genotypes to be evaluated.In this context, the study of the most important characteristics in prediction is necessary, since it is possible to reduce physical effort, cost, labour, and time in experimentation (Ferreira et al., 2015).
Predicting the importance of flood-irrigated rice characteristics is of paramount importance for breeding programmes, as it directs genotype selection more practically, in addition to serving as a theoretical and practical framework in support of new recommendation cultivars.In practical terms, these results are consistent.
Therefore, our study presents the performance of some methodologies to evaluate the relative contributions of each variable through computational intelligence and machine learning in flood-irrigated rice culture.An approach to quantify the effect of explanatory variables on genetic improvement has successfully identified the true importance of each variable, including those that exhibit strong and weak correlations with the main variables, which in our case are grain yield, length of panicle and grain length-to-width ratio.
Researchers can now identify the individual and interactive contributions of the predictor variables to the rice crop using artificial intelligence and machine learning.

Conclusion
Computational intelligence and machine learning methodologies were able to quantify the importance of explanatory variables in the prediction of grain yield in rice, grain length and width ratio, and panicle length.In addition to artificial intelligence and machine learning, it is able to handle more reduced or redundant information in the input variables.The characteristics able to assist in decision making are flowering, number of grains filled by panicles, and panicle length.The network with only one hidden layer with 15 neurons was efficient in determining the relative importance of variables in flooded rice.

Figure 1 .
Figure 1.Maximum estimate of the coefficient of determination in three environments to predict grain yield (GY), panicle length (PL), and grain length and width ratio in flood-irrigated rice (LGW).A: panicle length; B: grain yield; C: grain length-to-width ratio; RG: multiple regression; PMC: multilayer perceptron; RBR: radial base network; AD: decision tree; FA: random forest; BA: bagging; BO: boosting.

Table 1 .
Estimates of the coefficient of determination, provided by the use of the PMC, to predict grain yield, panicle length and grain length and width after disturbance (zero value assignment) in the explanatory variable values.

Table 2 .
Goh (1995)91)f the relative contribution estimated by the method ofGarson (1991)modified byGoh (1995)of 12 variables to predict grain yield, panicle length, and grain length and width ratio in flood-irrigated rice in three environments in the State of Minas Gerais.

Table 3 .
Estimates of the coefficient for determining the grain yield prediction, panicle length, and grain length-to-width ratio using the RBF assigning zero value to the genotype information.