Multivariate adaptive regression splines ( MARS ) applied to daily reference evapotranspiration modeling with limited weather data

Estimation of reference evapotranspiration (ETo) is very relevant for water resource management. The Penman-Monteith (PM) equation was proposed by the Food and Agriculture Organization (FAO) as the standard method for estimation of ETo. However, this method requires various weather data, such as air temperature, wind speed, solar radiation and relative humidity, which are often unavailable. Thus, the objective of this study was to compare the performance of multivariate adaptive regression splines (MARS) and alternative equations, in their original and calibrated forms, to estimate daily ETo with limited weather data. Daily data from 2002 to 2016 from 8 Brazilian weather stations were used. ETo was estimated using empirical equations, PM equation with missing data and MARS. Four data availability scenarios were evaluated as follows: temperature only, temperature and solar radiation, temperature and relative humidity, and temperature and wind speed. The MARS models demonstrated superior performance in all scenarios. The models that used solar radiation showed the best performance, followed by those that used relative humidity and, finally, wind speed. The models based only on air temperature had the worst performance.


Introduction
Evapotranspiration is one of the main components of the water cycle, allowing the transfer of water and energy into the atmosphere (Fernandes, Paiva, & Rotunno Filho, 2012).Its estimation is very relevant for decision making regarding water use, irrigation scheduling, environmental studies, and others (Pereira, Allen, Smith, & Raes, 2015).
The Food and Agriculture Organization (FAO) proposed the Penman-Monteith (PM) equation as a standard method for estimation of reference evapotranspiration (ET o ) (Allen, Pereira, Raes, & Smith, 1998).It is an equation that, due to its physical basis, requires several climatic parameters, such as air temperature, relative humidity, wind speed and solar radiation.The large number of required climatic variables is one of the factors responsible for the satisfactory performance of this method; however, its use is limited, since these data are commonly unavailable or unreliable in several regions of the world (Talaee, 2014), especially in developing countries (Djaman, Irmak, & Futakuchi, 2017), such as Brazil.
To estimate ET o with limited weather data, many studies have been conducted by using a reduced number of variables and developing empirical and semi-empirical models based on temperature (Hargreaves & Samani, 1985;Oudin et al., 2005), temperature and solar radiation (Makkink, 1957;Jensen & Haise, 1963), temperature and relative humidity (Valiantzas, 2013), and others.These methods, unlike the PM equation, which can be used globally without additional adjustments (Pereira et al., 2015), require local calibrations to obtain more satisfactory performances (Gao, Peng, Xu, Yang, & Wang, 2015).
Among data driven methods, multivariate adaptive regression splines (MARS) is a promising technique for estimation of ET o ; however, for this purpose, it has been poorly explored.MARS is a type of regression proposed by Friedman (1991) that is capable of modeling complex relations between a response variable and a set of predictor variables.According to Koc and Bozdogan (2015), this technique has been applied successfully in several areas of knowledge, such as medicine, business, molecular biology and several others.In addition to its potential for modeling, MARS also has the able advantage of being used in the form of an explicit algebraic equation, unlike other data driven methods, such as ANN, SVM, ELM, and ANFIS.
In this context, this study aims to compare the performance of MARS and alternative equations (in their original and calibrated forms) to estimate daily ET o with limited weather data.

Database and study sites
To carry out this study, daily data from 2002 to 2016 were collected from 8 weather stations of the National Institute of Meteorology (INMET) of Brazil, available in the Meteorological Database for Teaching and Research (BDMEP) (Table 1).The stations were selected to cover various climatic conditions and, consequently, to make the results more representative.Thus, stations were selected in several regions of Brazil, and their main characteristics are presented in Table 2. Maximum and minimum air temperature (°C), relative humidity (%), sunshine duration (h) and wind speed at 10 m height (m s -1 ) were obtained.The wind speed was converted to 2 m height, and the solar radiation was estimated based on sunshine duration, as recommended by Allen et al. (1998).
The collected data were submitted to a preprocessing, which eliminated days with missing data or the presence of a minimum temperature that was higher than the maximum temperature, negative sunshine duration or sunshine duration higher than the photoperiod, negative or greater than 100% relative humidity and wind speed (at 10 m height) with negative or greater than 20 m s -1 value.

Methods for estimation of ET o
The Penman-Monteith (PM) equation (Equation 1), using all required measured meteorological data, was used as a reference to estimate daily ET o .
(1) where: ET oPM represents the reference evapotranspiration estimated by the Penman-Monteith equation (mm day - 1 ), R n represents the net solar radiation (MJ m -2 day -1 ), G represents the soil heat flux (MJ m -2 day -1 ; considered to be null for daily estimates), T represents daily mean air temperature (°C), U 2 represents the wind speed at a 2 m height (m s -1 ), e s represents the saturation vapor pressure (kPa), e a represents the actual vapor pressure (kPa), ∆ represents the slope of the saturation vapor pressure function (kPa ºC -1 ), and γ represents the psychometric constant (kPa ºC -1 ).
To evaluate the performance of MARS against conventional equations, ET o was estimated in four measured data availability scenarios: temperature data only, temperature and solar radiation, temperature and relative humidity, and temperature and wind speed.To accomplish this, besides MARS, empirical equations and the Penman-Monteith equation with missing data were used.
To estimate ET o using the PM equation with missing data, actual vapor pressure and solar radiation were estimated with Equations 9 and 10, respectively, and wind speed was set at 2 m s -1 , as recommended by Allen et al. (1998). (9) where: e a represents the actual vapor pressure (kPa), and T min represents the minimum air temperature (°C).(10) where: Rs represents solar radiation (MJ m -2 day -1 ), R a represents extraterrestrial radiation (MJ m -2 day -1 ), T max represents the maximum air temperature (°C), and T min represents the minimum air temperature (°C).
To improve the performance of the equations studied, a local calibration was performed with a simple linear regression using data from 2002 to 2011 (10 years), as suggested by Allen et al. (1998).To accomplish this, a linear regression was fitted so that the ET o values estimated by the reference method (i.e., PM with full data set) were set as the dependent variable, and those estimated by the equation to be calibrated were set as the independent variable, according to Equation 11.The obtained intercept (a) and slope (b) were used as local calibration parameters.(11) where: ET ocal represents the reference evapotranspiration estimated by the calibrated equation (mm day -1 ), a and b represent the calibration parameters, and ET o represents the reference evapotranspiration estimated by the equation to be calibrated (mm day -1 ).
The MARS method is a nonparametric multivariate regression technique that can map relations between input and output variables without assumptions, model nonlinearities and interactions, and automatically choose the variables that are important for the modeling process.In MARS, base functions are fitted at different intervals of the independent variables.The initial and final points of these intervals are called knots (Mehdizadeh et al., 2017).A MARS model with 2 base functions can be seen in Figure 1.To evaluate the performance of the MARS models and the equations studied, these were divided according to the weather data required for each model (Table 3).
Temperature and solar radiation Temperature and relative humidity The development process of a MARS model involves two steps; in the first step (i.e., forward step), an over-fitted model is produced with a large number of knots.In the second step (i.e., backward step), a pruning technique is applied to remove redundant knots (Kisi, 2015).More details about MARS can be found in Cheng and Cao (2014).
MARS models were developed using ET o estimated by the PM equation (with full data set) as benchmark, with data from 2002 to 2011 (10 years).The implementation process was performed using the py-earth library for the Python programming language.

Performance evaluation
The performance of the models was evaluated using data from 2012 to 2016 (5 years).To accomplish this, the statistical indices root mean square error (RMSE), mean absolute error (MAE) and coefficient of determination (R 2 ) were used based on the following equations.
(P i -P )( i -) where: RMSE is the root mean square error (mm day -1 ), MAE is the mean absolute error (mm day -1 ), R 2 is the coefficient of determination, Pi represents the predicted value (mm day -1 ), Oi represents the observed value (mm day -1 ), P represents the mean of predicted values (mm day -1 ), represents the mean of observed values (mm day -1 and n represents the number of data pairs.

Temperature-based models
By evaluating the performance of the PMT, HS and OUD methods, it was observed that, in general, the PMT method had the best performance, with a lower RMSE and MAE values, while the OUD method had the worst performance (Table 4).These results corroborate with Almorox, Senatore, Quej, and Mendicino (2018), who concluded that the PMT equation presents ET o estimates at a monthly scale more accurate compared to the HS equation in different regions of the world.Similarly, Alencar, Sediyama, and Mantovani (2015) obtained better performances of the PMT equation compared to the HS equation on a daily scale in a study carried out in Brazil.The better performance of the HS equation compared to the OUD equation was also reported by Almorox, Quej, and Martí (2015).After local calibration, better performances were obtained for all methods at all sites.The calibrated PMT and HS methods had performances very similar to each other, surpassing the calibrated OUD method.The fact that the OUD method had the worst performance can be justified by the structure of the equation, which was not able to satisfactorily explain the relationship between the input variables (i.e., air temperature and extraterrestrial radiation) and ET o , with smaller R 2 values.It is important to note that calibration by simple linear regression does not change the R 2 value.According to Liu et al. (2017), an equation can improve its performance with a local calibration; however, when it presents a failed structure, the structure optimization should receive special attention.
The MARS models that were developed with only measured data of for air temperature (MARS1) had performances significantly superior to the temperature-based equations in their original forms.Considering the calibrated equations, the MARS1 models continued to have superior performances, but with smaller differences.The largest performance improvements were observed at the Macapá and Palmas stations, where the MARS1 models were able to better correlate air temperature and extraterrestrial radiation with ET o , which had higher R 2 values and lower RMSE and MAE values.It is important to emphasize that, even though these models had the best performance, the results obtained by the MARS1 models were considered reasonable, as they did not attain a high enough performance.
In general, the obtained results show the complexity involved in the ET o modeling process using only measured data of air temperature; even with the best model (MARS1), it was not possible to obtain a high performance.According to Almorox et al. (2015), temperature-based models have low correlations with the PM method (with full data set) in tropical climates, where the role of other climatic variables, such as vapor pressure deficit, can be decisive.Corroborating these authors, it was verified that the highest R 2 values were obtained at the Curitiba, Lages and Santa Maria stations, which are all located in southern Brazil and belong to climatic class C (temperate climate) (Table 1).

Temperature and solar radiation-based models
As observed in the temperature-based models, the Penman-Monteith equation, when using measured data of temperature and solar radiation (PMR), had a stronger performance when compared to the Makkink (MAK) and Jensen-Haise (JH) equations, with lower RMSE and MAE values (Table 5).The PMR equation was only surpassed at the Cabrobó station, where the JH equation had a better performance, and at the Eirunepé and Lages stations, where the Makkink equation had a best performance.Sentelhas, Gillespie, and Santos (2010) also obtained good results using the PMR equation.On the other hand, the JH equation had the worst performance, corroborating with Cunha, Magalhães, and Castro (2013), which reported the unsatisfactory performance of the JH equation in its original form.
After calibration, all of the equations obtained more accurate results; there were close performances among them, with slight superiority for the JH equation.It is important to highlight that the results obtained by all of the calibrated equations were quite satisfactory and had a strong agreement with ET o estimated by the PM equation with full data set.Lower performances were obtained only at the Palmas and Cabrobó stations, but these were still considered reasonable.It should be noted that at the Eirunepé station, the MAK and JS models estimated ET o with extremely high precision, obtaining R 2 values equal to 0.98 and 0.99, respectively.Before the calibration the JH equation had the worst performance; however, due to its better structure, evidenced by its higher R 2 values, the local calibration was able to make it quite accurate and surpass the others.
The lower performance observed at the Palmas and Cabrobó stations was possibly related to the higher standard deviation values observed for relative humidity (at Palmas station) and wind speed (at Cabrobó station) (Table 2).The greater oscillation of variables that were not used as inputs may lead to a lower performance of the methods, since these variables have a greater influence on ET o , which makes the modeling process even more complex.
In turn, the MARS model developed with temperature and solar radiation data (MARS2) also had a high performance, with a better performance than the non-calibrated equations and a similar performance, that was slightly higher, than the calibrated equations.

Temperature and relative humidity-based models
By evaluating the performance of the temperature and relative humidity-based models, it was possible to observe that the Romanenko (ROM) equation had the worst performance compared to the Valiantzas (VLT) and Penman-Monteith equations using only measured data of temperature and relative humidity (PMH).The ROM equation, apart from having the highest RMSE and MAE values, also had the lowest R 2 values (Table 5).The PMH equation performed better than the VLT equation at almost all evaluated sites.In studies performed by Tabari, Grismer and Trajkovic (2013) and Mehdizadeh et al. (2017) in arid and semiarid regions, the ROM equation had a performance lower than the temperature-based methods, which was also seen in this study.of these methods which, in turn, causes performance variations according to the climatic conditions of the location where the methods are applied (Raziei & Pereira, 2013;Feng et al., 2017).
It is also important to note the role of the calibration.In the present study, calibration improved the performance of the equations in all evaluated scenarios.Because these equations are empirical, calibration becomes a very important factor because it allows to adjust the model to the conditions of the location where it will be used.Several authors have suggested the calibration of empirical equations to obtain better estimations of ET o (Allen et al., 1998;Liu et al., 2017;Shiri, 2017).
In all of the evaluated scenarios, MARS models were able to estimate ET o with the best performance, surpassing conventional equations even after calibration.This behavior reaffirms the power of MARS to model complex problems, such as ET o .Thus, the use of MARS has proven to be a viable alternative for estimation of ET o when available climatic data are limited.Mehdizadeh et al. (2017) also reported the superiority of MARS over conventional equations for estimation of ET o .
To compare MARS models developed in different data availability scenarios, the values of the statistical indices obtained in each scenario are presented in box plots (Figure 2).Based on Figure 2, it was observed that MARS2 had the best performance, followed by MARS3, MARS4, and, finally, MARS1.This behavior indicates the importance of solar radiation for estimation of ET o , since the model that incorporated this variable (MARS2) had the best performance.The addition of relative humidity and wind speed, present in the MARS3 and MARS4 models, respectively, also promoted performance improvements in relation to the model that only used measured temperature data (MARS1), as the R 2 mean value increased from 0.73 to 0.82 in both cases and RMSE and MAE values decreased.The fact that the MARS3 model had a performance slightly higher than the MARS4, which can be noted by lower RMSE and MAE mean values, indicates that, in general, relative humidity has a greater influence on ET o at the evaluated stations than wind speed.
Corroborating with the results obtained in this study, Córdova, Carrillo-Rojas, Crespo, Wilcox, and Célleri (2015) indicated solar radiation as the most important variable for estimation of ET o , followed by relative humidity and wind speed, in that order.Sentelhas et al. (2010) also reported a superior performance of methods using solar radiation as an input variable.In addition, according to Allen et al. (1998), solar radiation represents the largest energy source that promotes water vaporization and, consequently, evapotranspiration.

Conclusion
Local calibration improved the performance of the evaluated equations.The Penman-Monteith equation with missing data represents an alternative to empirical equations, having, in general, equal or superior performance.
MARS models are good options for estimation of ET o under conditions of limited weather data, as they present the best performances across all the assessed data scenarios.
The models that used, besides temperature, solar radiation had the best performance, followed by models that used relative humidity and, finally, wind speed.The temperature-based models showed the worst performance; however, these can be used with reasonable performance, mainly by means of MARS models.

Figure 1 .
Figure 1.MARS model with two base functions.

Figure 2 .
Figure 2. Box plots of RMSE (a), MAE (b), and R 2 (c) for the MARS models.

Table 1 .
Location, altitude and climate classification of the chosen weather stations.

Table 2 .
Daily mean values and standard deviations of meteorological variables for the chosen stations.

Table 3 .
Methods used in the study and their respective input variables.

Table 4 .
Statistical indices for the temperature-based models.
MARS1 -multivariate adaptive regression splines with measured data of temperature; PMT -Penman-Monteith with measured data of temperature; HS -Hargreaves-Samani; OUD -Oudin.The expression "cal" indicates the locally calibrated version of a given model.