Soil surface spectral data from Landsat imagery for soil class discrimination

The aim of this study was to develop and test a method to determine and discriminate soil classes in the state of São Paulo, Brazil, based on spectral data obtained via Landsat satellite imagery. Satellite reflectance images were extracted from 185 spectral reading points, and discriminant equations were obtained to establish each soil class within the studied area. Sixteen soil classes were analyzed, and discriminant equations that comprised TM5/Landsat sensor bands 1, 2, 3, 4, 5, and 7 were established. The results showed that this methodology could effectively identify individual soil classes using discriminant analyses of the spectral data obtained from the surface. Success rates of > 40% were achieved for 14 of the 16 evaluated soil classes when applying the satellite image data. When the 10 soil classes containing the largest number of minimum cartographic areas were used, the hit rate increased to > 50%, for seven soil classes with a global hit rate of 52%. When the soil classes were grouped based on their parent materials, the hit rate increased to 70%. Thus, we concluded that the spectral method for soil classification was efficient.


Introduction
In recent decades, Brazil has become a prominent worldwide leader in agricultural exports due to the expansion into new agricultural frontiers, implementation of technological improvements, and exceptional efforts to address adversities among growers.To continue this growth, Brazil must adopt suitable practices for using and managing previously uncultivated land areas.
Soil surveys that include appropriate and detailed cartographic representations allow us to obtain a wealth of information that, when properly managed, can ena-ble technicians and growers to employ methodologies and establish strategies that extend and even increase the productive capacity of their lands.However, there are an insufficient number of professionals available to map such a large country, and there are still millions of hectares of land left for the rational and sustainable production of food (DALMOLIN, 1999).Demattê (2001) stated that the study and implementation all available technologies regarding soil spectral analyses is of critical significance.Spectral behavior analyses of soils using remote sensors might provide answers to the problem of class discrimination.Such studies are based on the fact that each soil presents a spectral signature according to the absorption of specific wavelengths across the entire electromagnetic spectrum (BEN-DOR et al., 1999).
The fundamental theory of soil spectral analyses indicates that soils can be characterized according to their B horizon.This horizon is not detected using satellite data unless erosion has occurred.Therefore, how could spectral responses help in discriminating soil units?To answer this question, we begin with the assumption that many soil classes exhibit surface characteristics that differ from other classes and therefore can be used as a taxonomic identifier.This assumption has been combined with the use of aerial photographs for soil discrimination.This method does not detect B horizon values; however, it does infer information regarding the 'probable' soil class.A small number of published studies of soils and tropical soils have suggested that the use of satellite images can assist in the discrimination of surface information.
Because a spectral response is also an individual trait, the use of this technique allows for the separation of soil classes and may therefore aid in pedological surveys.Because soil properties are related to soil classes, and these soil properties are related to spectra, we can correlate spectra with soils to facilitate the detection of soil types.Therefore, the aim of this study was to develop and test a method for determining and distinguishing different soil classes in the southwestern region of São Paulo State, Brazil, based on spectral data obtained from satellite images.

Material and methods
The studied area is located in the southwestern region of the state of São Paulo and is delimited by the geographical coordinates 23º0 '31.37" -22º58'53.97" south latitude and 53º39'47.81" -53º37'25.65"west longitude in a region known as the Paleozoic depression (IPT, 1981).As part of the hydrographic basin of the Tietê River (OLIVEIRA et al., 1992), the area is bordered by the Capivari river and spans an area of approximately 198 ha with a perimeter of 11,045.80m (Figure 1).
Geologically, the studied area is situated in the Itararé Formation and is part of the Tubarão Group.These units were formed between the Upper Carboniferous Period and Middle Permian Epoch and occur in the state a complex association of several lithofacies, almost all of which are detritic that formed in rapid vertical and horizontal successions (IPT, 1981).The predominant lithologies consist of mineralogically immature sandstones with heterogeneous granulation ranging from feldspathic sandstones to arkoses; these units range from thin layers to banks that are tens of meters thick (VIDAL-TORRADO; LEPSCH, 1999).The geographic information system "Sistema de Processamento de Informações Georreferenciadas (SPRING)" (INPE, 1999) was used to implement the cartographic study.This system has multiple functions and algorithms for processing georeferenced databases.A database was established using this system to incorporate the geospatialized information that was obtained from planimetric and altimetric charts and data that was collected in the field using GPS receivers and orbital images.
In the entire area, 185 augured points were selected (Figure 1).These points were marked in a grid fashion with one point per hectare (WOLKOWSKI; WOLLENHAUPT, 1994) following procedures used in soil surveys.Specifically, a synthetic approach was used (EMBRAPA, 1996) and the database was georeferenced with a differential global positioning system (DGPS) technique, which is a method used with GPS receivers to achieve higher precision with these data receptors in absolute positioning modes.
The collected samples were first dried in a forced ventilation oven at a constant temperature of 50ºC for 48h and then sifted through a 2-mm sieve (airdried soil).The textural soil groupings were conducted according to Embrapa (1999).The densitometric method was used to determine the total sand, silt and clay contents (CAMARGO et al., 1986).Organic matter (OM), active and residual acidity, pH and cation exchange capacity (CEC) determinations were performed according to Embrapa (1997).The sum of the cations (i.e., calcium, magnesium and potassium) (SC), base saturation (V%= [SC/CEC]*100), and aluminum saturation (m%= [Al 3+ /SC+Al 3+ ]*100) were determined according to Van Raij and Quaggio (1989).Total iron (Fe), silicon (Si), and titanium (Ti) contents were determined by sulfuric acid digestion according to a methodology recommended by Embrapa (1997).The soil color was identified using the Munsell color chart (GRETAGMACBETH, 2000).
The physiographic unit boundaries were established after generating orbital image visual interpretations according to the recommendations of Donzeli et al. (1983) and Nanni and Rocha (1997).These boundaries were then used in the SPRING system after generating successive combinations of algorithms and filters to improve the visual quality as reported by Nanni and Rocha (1997).The soil classes were defined after analyzing the laboratory samples and open trenches at representative physiographic unit locations.The fieldwork procedures and the description and collection of the materials were performed according to the criteria established by Lemos and Santos (1996).The soil classes were defined according to the Brazilian classification system (EMBRAPA, 1999) and correlated to the soil taxonomy (COSTA; NANNI, 2004).

Spectral data acquisition and processing
The information extracted from the pixels was used to extract the gray value levels for each TM-Landsat-5 image point that corresponded to a sampling location in the field.The gray value levels for each band were then converted into reflectance values according to the procedures established by Markham and Barker (1986) and Thome et al. (1997).For Rayleigh scattering and ozone absorption corrections, the 5S irradiative transfer code simulation (TANRÉ et al., 1992;VERMOTE et al., 1997) was used and properly corrected for atmospheric effects (KAHLE et al., 1980).After the conversion and correction procedures, the zero gray level images corresponded to 0% reflectance, whereas a gray level of 255 corresponded to 100% soil reflectance (DEMATTÊ;NANNI, 2003).
To verify whether the points in the image that corresponded to the field-collected points characterized the exposed soils (i.e., without plant cover), the normalized difference vegetation index (NDVI) methodology was applied (DEMATTÊ et al., 2000;JACKSON, 1983;KAUTH;THOMAS, 1976).

Statistical analyses
The Statistical Analysis System (SAS, 1992) software package version 6 was used to manage the obtained data and perform the statistical analyses.To organize the database for statistical analyses and the evaluation of hypotheses, the reflectance values for each soil were fitted to a data matrix used for this analysis, which consisted of the six TM-Landsat-5 bands for each sampled terrain.
Completely randomized variance analyses were performed on this matrix, and the reflectance means of each soil per selected wavelength range were analyzed using the Tukey test at the p ≤ 0.01 and p ≤ 0.05 levels.The general linear models (GLM) method of the SAS program was used to test each selected wavelength range for the hypothesis where H o : soil1=soil2=soil3=soil4=soil5.The rejection of this null hypothesis implies the acceptance of an alternative hypothesis where H 1 : at least two soils are statistically different.The purpose of this study was to determine which variables would have a higher or lower potential to yield models that best explain the problem.Therefore, the STEPDISC method (SAS, 1992) was initially employed to select suitable bands.Discriminant analyses were then conducted to differentiate and characterize the soil with the goal of developing and testing the method for soil class determination based on either its spectral data or analytical characteristics.Thus, the DISCRIM method (SAS, 1992) was used.The parametric method was used to develop the discriminant linear functions.

Results and discussion
The area studied included a variety of parent materials consisting of reworked sandstonediabase, sandstone-shale, diabase-shale materials, and a combination of all three.As a result of these geological conditions, 18 soil classes were established (Table 1).
The following soil subgroups and land areas dominated the study area: Paleudults and Paleudalfs, 58.12 ha (31.12%);Oxysols, 39.60 ha (21.21%);Dystrochrepts and Eutrochrepts, 36.92 ha (19.76%);Rhodic Paleudalfs, 19.08 ha (10.22%);Argiudolls, 12.76 ha (6.84%); Udorthent, 9.76 ha (10.22%); and Udifluvents, 10.52 ha (5.63%).The declivity conditions preestablished the possible presence of thin soils with depths ranging between 1.50 and 2.00 meters.Deeper soils were found in plains or gently rolling areas and reached up to tens of meters in depth.Because the cartographic land areas included in the strongly rolling and mountainous classes (with soils less than 0.80 cm in depth) were small, lower proportions of shallow soil classes were observed.Rocky outcrops were also included in the shallow soil class.
As expected, the soil classes that contained a small number of samples showed smaller amplitudes or variability of the attributes allowed by the class.The B1, B2, B3, B4, B5, and B7 values corresponded to the bands from the TM sensor (Table 1).This data bank was structured according to the following guidelines: (1) the soil spectral patterns from a given area are used to develop statistics and equations; (2) when a soil sample cannot be classified, its spectral information must be acquired; (3) the spectral information must be applied to all previous equations (Table 1); and (4) the highest obtained data on equations results will be the desired soil.However, such reasoning is valid only for the soils and conditions that were observed in this study.
Furthermore, this reasoning created the potential to characterize different regions in the country, thus allowing for faster preliminary identifications, as stated by Coleman and Montgomery (1990), Demattê andGarcia (1999), andNanni et al. (2004).
According Gerbermann and Neher (1979), if a dataset is obtained via an automated technique, soil maps will be produced faster than with conventional methods.Table 2 summarizes the hit percentages for the classification of the soils that were examined using the discriminant equations compared to the conventional classification methodology.Of the 18 analyzed classes (Table 3), only five showed results hit values of > 60% within the class.Among these classes, the best results were achieved for CE1 and NVef2 with a 75% hit rate within each class.However, most of the classes showed inadequate results.Total failure was observed (i.e., no observations were auto classified using the established discriminant equations) for the soil classes CD2, PVAd and PVAe2 with six, two, and five individual sites, respectively.
In contrast to the total failure observed, the following aspects should be considered: (1) a misidentification occurred for those classes with similar physical surface characteristics, such as organic matter content, iron, clay, sand, and color, which would be difficult to distinguish even in the field; (2) classes with higher iron contents due to their parent material (diabase) showed infrequent or no misidentifications for confusion with the classes derived from sandstone; (3) the Dystrochrepts soil, which is a transitional class, showed the highest number of classification errors; and (4) in some classes with low numbers of sites, an error in a single point could result in a high rate of error for the class, which would contribute to an increase in the global error rate.For example, the Rle2 class (i.e., litholic soil with a shale substrate) with only two sites was problematic.Because one of the sites was classified as a Eutrochrepts soil with a shale substrate (CE2), the error rate was 50%.Considering that these soils were similar at the surface and in satellite images, the reported error (50%) was high and affected the global error for the entire analysis.This class showed the highest global median error value among all analyses performed for this study (61.36%).
In an attempt to eliminate this problem, only the 10 soil classes with the highest number of sites were evaluated.The classes that showed more than seven individual sites were selected following the procedure reported by Oliveira et al. (1982).To ensure that the statistical analyses could not exhibit a marked influence on the standard error of the average, a minimum of 7 samples per soil class were used.
Of the 18 identified classes in the study area, only 10 contained more than 7 observations, which were classified as the following: RUd, MTfr, CD1, CE2, CD2, LVe, LVAe, PVAe1, PVAe2, and NVef1.The obtained equations for the most abundant classes are shown in Table 3.
Table 4 summarizes the percentages of hits for the classification of the soils that were examined using the discriminant equations.
By limiting the number of soil classes, the global error was reduced to 48.25%.This reduction occurred for two primary reasons.The first reason was the elimination of classes with a small number of sites, which contributed to the increased classification error and the low hit percentages within each class.The second reason for the error reduction was the inclusion of those classes that were previously classified into one of the excluded classes.For example, the CD2 class had a hit rate within its own class of 0% based on an analysis that involved all classes.The hit rate increased to 33.33% when the analysis involved only the 10 classes with the highest number of sites.Furthermore, the classes with similar surface characteristics caused misidentifications during the classification.Therefore, because the orbital data only referred to the surface soil layers, further discriminant analyses were conducted to group the soils according to their parent materials.The classes that were related to the parent material were designated as sand, shale, shale/diabase, and diabase.
The following soil classes were combined to compose the sand group: RUd, CE1, CD1, CD2, LVAd, PVAe1, PVAd, PVAe2, and PVAe3.The CE2 and Rle2 classes formed the shale group.The shale/diabase group contained the MTfr class, and the diabase group consisted of the CE3, LVe, LVAe, Rle1, NVef1, and NVef2 soil classes.The discriminant equations for these groupings are shown in Table 5.
Among the six bands that were evaluated using STEPDISC from the SAS system (Table 5), only the TM_B2 band was not used to assemble the discriminant equations.All of the other bands showed statistical significance at p ≤ 0.01.After the discriminant equations were established, the classification data were generated as shown in Table 6.A significant improvement was observed when the soil classes were grouped according to their parental material (Table 7).The global error for the Chi-square test was 25.51%, which was well below the 48.25% error for the classification using the 10 most populous classes and far better than the 61% error when all classes were used separately.Differences between the parent materials were found using the spectral data that were obtained by the TM Landsat sensor in the soil line analyses (HUETE, 1989) as described by Nanni and Demattê (2006).Nanni and Demattê (2006) found higher variations in soil lines for the soils classes that were developed in areas with sandstone and low iron levels relative to those that contained high amounts of clay and iron.
This fact observed by Nanni and Demattê ( 2006) opens new perspectives and possibilities for geological surveys because this method can set realistic boundaries between the different parent materials.This method confirms that the different parent materials can be discriminated according to their reflectance due to physical interactions with electromagnetic energy.To reinforce the discriminant analyses, a simulation was performed in which 80% of the sampled sites were used to generate a discriminant model that could be tested using the remaining 20% of the data.The sampled sites were randomly selected; i.e., the SAS system randomly selected the components that would take part in the discriminant analyses (80%) and those that would be used to test the obtained models (20%).The procedure was tested using 50 simulations with the system set to randomly choose 80% of the sampled sites, which generated the discrimination model that was used to test the remaining 20% of the sites.Only the 10 soil classes with the highest number of observations were used.Furthermore, classes with a reduced number of sites might not show the variability that exists within the class or might have characteristics that resemble another class.These issues could interfere with the safe and reliable analyses.In addition, as previously described, classes with a small number of sites greatly increased the global error.
After the simulation, the system yielded results that showed the data classification percentage within the model (Table 7).
The hit rate was low when 80% of the observations were applied to the model.Of the 5825 matches that were generated by the 50 simulations for the model-examined data, the model produced 3102 misclassifications with a global error of 53.3%, whereas the classifications matched in 2723 instances with an accuracy of 46.7%.Furthermore, 11% of the total data was lost (632 times).Hence, a difference between the numbers presented in Table 7 (5193) and the total number (5825) was observed.Despite this difference, the presented model was highly significant according to the Chi-square test (p ≤ 0.01).The correlation between the classes that were observed and estimated by the model showed an r-square value of 83.8% as defined by the contingency coefficient.Most of the results indicated that as a first analysis, this classification system could be important for assisting in the next step of soil classification.
Discrepancies between the classifications and field results were due to several parameters, such as the relief and soil surface property, which are not related to the undersurface.In fact, the subsurface is more significant when classifying soils.This horizon contains less organic matter and influences soil management.In addition, the mineralogy is generally more accurate in the subsurface due to less weathering.For many reasons, the surface information provides important indications to assist the convergence of evidence and ultimately aid in soil classification.In the field, the surface of each soil type usually exhibits different appearances due to the dynamics of water, organic matter, relief, and humidity.All of these factors interfere with the spectral information and are responsible for the observed differences.
Therefore, we believe that the emergence of new sensors installed on orbital or sub-orbital platforms with a greater number of bands or spectral ranges will be very important for discriminating soils with a noticeable reduction of classification errors.

Conclusion
The surface information collected assisted in the discrimination of soils in the studied area using statistical analyses and the data obtained at the orbital level.Based on the discriminant analyses, the hit rate increased as the sample number decreased.The simulated statistical test was efficient for establishing errors and matches during the discriminant analyses.
Satellite-derived surface data can provide valuable and relevant information for soil class discrimination and assist in soil surveys.In addition, soil classes that were grouped according to their parent material were identified via discriminant analyses.

Table 1 .
Discriminant equations obtained using the TM-Landsat evaluation bands for all soil classes.

Table 2 .
Discriminant analyses, the number of samples, and the percent of soil classifications for all soil classes in the studied area.

Table 3 .
Discriminant equations obtained using TM-Landsat-evaluated bands for the 10 soil classes with the highest number of observations in the studied area.

Table 4 .
Discriminant analyses, the number of samples, and soil classification percentages for the 10 soil classes with the highest number of observations.

Table 5 .
Discriminant equations obtained using TM-Landsat-evaluated bands for the parental material groups in the studied area.

Table 6 .
The number of observations and percent of soils classified using TM-Landsat bands for the parent material groups.

Table 7 .
The number of observations and soil percentages classified using the 10 soil classes with the highest number of observations in the studied area and 80 % of the observations to develop the model.