Image analysis of seeds and machine learning as a tool for distinguishing populations: Applied to an invasive tree species

. Invasive species threaten crops and ecosystems worldwide. Therefore, we sought to understand the relationship between the geographic distribution of species populations and the characteristics of seeds using new techniques such as seed image analysis, multivariate analysis, and machine learning. This study aimed to characterize Leucaena leucocephala (Lam.) de Wit. seeds from spatially dispersed populations using digital images and analyzed their implications for genetic studies. Seed size and shape descriptors were obtained using image analysis of the five populations. Several analyses were performed including descriptive statistics, principal components, Euclidean distance, Mantel correlation test, and supervised machine learning. This image analysis technique proved to be efficient in detecting biometric differences in L. leucocephala seeds from spatially dispersed populations. This method revealed that spatially dispersed L. leucocephala populations had different biometric seed patterns that can be used in studies of population genetic divergence. We observed that it is possible to identify the origin of the seeds from the biometric characters with 80.4% accuracy (Kappa statistic 0.755) when we applied the decision tree algorithm. Digital imaging analysis associated with machine learning is promising for discriminating forest tree populations, supporting management activities, and studying population genetic divergence. This technique contributes to the understanding of genotype-environment interactions and consequently identifies the ability of an invasive species to spread in a new area, making it possible to track and monitor the flow of seeds between populations and other sites.


Introduction
Leucaena leucocephala (Lam.) de Wit.(Fabaceae) is a tree native to Central America and Mexico.It is currently distributed in most tropical areas of the world (Fonseca & Jacobi, 2011) and occurs in cultivated and invaded areas (Crawford et al., 2015).The occurrence of L. leucocephala is limited to temperatures between 10 and 40°C with annual rainfall between 600 and 1,700 mm.However, this species can also occur in regions with only 250 mm of annual rain and drought for up to eight months, as is the case in semiarid conditions (Drumond & Ribaski, 2010).
Its growth is relatively fast, making it promising for exploration in semiarid regions (Walker, 2012), which justifies the work done with this species in dry forests.In addition, its potential is associated with its ability to regrow, resist drought, adapt to soil conditions, productivity, and acceptance by animals as food (Câmara et al., 2015;Crawford et al., 2015;Azuara-Morales et al., 2020;Barros et al., 2020;Dueñas et al., 2020).The energy potential of its wood is similar to that of Eucalyptus grandis W. Hill (ex Maiden) (Machado, Andrade, Silva, Sena, & Thode-Filho, 2014), whose calorific capacity is adequate for charcoal production (Silva et al., 2018).
Invasive L. leucocephala has high potential for expansion in biodiversity hotspots and can advance in vulnerable and critical areas, including natural reserves in tropical and subtropical forests (Wan & Wang, 2018).There was an abundance of L. leucocephala semi-domesticated populations that spread spontaneously in marginal and disturbed areas.These may have different adaptive characteristics, depending on their geographical origins.Exotic species initially encounter adaptive difficulties when colonizing new environments, resulting from less genetic variation.However, the introduction of different genetic lineages drives the natural selection of genotypes that are more adapted over time, thus generating effective colonization and consequently increasing their potential for invasion (Wang et al., 2012).
For instance, Ambrosia artemisiifolia L., native to North America, was introduced in the Rhône Valley, France during the 18 th century.Currently, it is considered to be a highly invasive species in this region.According to a study, the introduction occurred from several sources, which explains the high genetic variability of A. artemisiifolia populations in France, reaching even more significant intrapopulation genetic variability than in populations of North American origin (Genton, Shykoff, & Giraud, 2005).
In this sense, the characterization of populations from different geographical origins is relevant to understanding the adaptive potential of plant species and to apply the results to genetic improvement programs.Therefore, the biometric study patterns of seeds from different populations can be an efficient method for this purpose, as they are the primary vehicle for the spread of germplasm.Variations in seed biometrics are intricately linked to environmental alterations and the genetic diversity of populations, which may be associated with their geographical origin (Brus et al., 2011;Santana, Torres, & Benedito, 2013;Costa et al., 2016;Khumaeva, Khabibov, & Muratchaeva, 2016;Rewicz, Bomanowska, Magda, & Rewicz, 2016;Pontes et al., 2018;Alfaro-Solís et al., 2020;Felix et al, Medeiros, Ferrari, Vieira, & Pacheco, 2020).
Artificial intelligence and machine learning can be widely used to recognize seeds based on the characteristics of each species (Bao & Bambil, 2021) and genotype (Tan, Wang, Li, & Gong, 2019;Landa et al., 2021;Ropelewska & Piecko, 2022).We hypothesize that there is congruency between the biometric descriptors of seeds among populations and their geographical distances.Additionally, we sought to understand how diversity of the seed descriptors of L. leucocephala were obtained for its invasive range.The paper aimed to characterize L. leucocephala seeds from spatially dispersed populations through digital imaging and analyze their implications for genetic studies.

Characterization of populations
We selected five newly established L. leucocephala populations in Rio Grande do Norte State, Brazil (Figure 1).Each feral population (escaped cultivation) was delimited on local roads.The sampled populations were fitted to a group of individuals corresponding to an area greater than 1.0 ha.The fruits (pods) were manually harvested from five individuals per population, and seed processing consisted of manual removal of pods and selection of healthy seeds, disregarding those that were defective and damaged by insects.

Image processing
The seeds of each population were subjected to digital image processing to obtain biometric descriptors.Seed biometrics were analyzed using the digital image editing program ImageJ ® (https://imagej.nih.gov/ij/).For each population, 800 seeds were photographed (12 MP camera) from 20 cm away on a matte white paper background, with a ruler graduated in millimeters.Image processing was based on converting the original image (Figure 2A) into an 8-bit format (256 tones) (Figure 2B), followed by calibration of the scale in millimeters, selection of the area to be analyzed, and inversion of the gray spectrum (Figure 2C).We then applied the 'threshold' mask to differentiate between the image components (Figure 2D) and particle analysis, thus obtaining biometric descriptors (Figure 2E), whose data were exported to an electronic data sheet (Microsoft Office Excel ® ).  ) were calculated automatically.
The evaluated biometric descriptors followed the descriptions of Ferreira and Rasband (2012): Area -seed selection area in calibrated square units, calculated within the perimeter defined polygon (mm²); Perimeter -length of the external limits of seed selection in calibrated units, calculated from the center of the limit pixels (mm); Circularity -scalar value from 0.0 to 1.0, indicating a perfect circle when close to 1.0, for the seed shape in relation to its perimeter 4π [area] [perimeter]² , and an elongated shape when close to zero; Acta Scientiarum.Agronomy, v. 46, e62658, 2024 Length -greater distance between two points along the seed selection limit in calibrated units (mm); Width the smallest distance between two points along the seed selection limit in calibrated units (mm); Aspect ratio -proportion of the adjusted seed ellipse from the major and minor axes .The convex area is characterized as a delimiter of the original shape of the object, such as a rubber band tightly wrapped around the area (Matsumoto et al., 2015).

Data analysis
The experimental design was completely random, with 800 sample units for each L. leucocephala population.We proceeded to the analysis of descriptive statistics, and cluster analysis by Euclidean distance was used to verify similarities between populations with the elaboration of a hierarchical dendrogram of the unweighted pair group method with arithmetic mean (UPGMA) similarity with the standardized group mean for the biometric descriptors.The relationship between spatial distance and population dissimilarity was analyzed using the Mantel correlation test (r M ), with 10,000 simulations between Euclidean distance and geographic distance (p < 0.05).The statistical program used was BioEstat ® (version 5.3) (https://www.mamiraua.org.br/downloads/programas/).
Multivariate analysis of the principal components (PCA) for the biometric descriptors was performed with standardized data ( ) to obtain eigenvalues and variance for the contribution of each principal component, eliminating the components with few representations.The statistical program used was Past ® (version 3.20) (https://palaeo-electronica.org/2001_1/past/issue1_01.htm).In addition, a machine-learning algorithm supervised by decision trees (J48-C0.25-M2) was tested.The program used was Weka ® (version 3.8) (https://www.cs.waikato.ac.nz/ml/weka/).

Results
The seeds of L. leucocephala had different physical characteristics between populations, as evidenced by the biometric descriptors quantified in the digital processing of the images for seed shape and size (Table 1).Seeds from population POP1 were larger in area (33.85 ± 4.03 mm²), perimeter (22.65 ± 1.47 mm), length (8.34 ± 0.59 mm) and width (5.42 ± 0.34 mm).In comparison, these characteristics were smaller in seeds from POP2 (area: 22.07 ± 2.27 mm², perimeter: 18.17 ± 1.03 mm, length: 6.70 ± 0.44 mm, and width: 4.47 ± 0.24 mm).The characteristics of the seeds of the other populations occupied an intermediate position between POP1 and POP2, proving physical differences between populations in the seeds of this species.When examining the characteristics of the L. leucocephala seed proportion and the ratio of the largest and smallest plane of the particle evaluated in the image, it can be seen that the seeds of POP3 were the largest (1.715 ± 0.13) and those of POP5 were the smallest (1.450 ± 013) (Table 1).The seeds of POP5 showed greater circularity (0.847 ± 0.02) and roundness (0.695 ± 0.06), whereas those of POP3 were less circular (0.799 ± 0.02) and round (0.587 ± 0.05), precisely because of the higher proportion of length and width.The solidity of the seeds of POP1 and POP4 had the highest index (0.972 ± 0.01), whereas those of POP2 had the lowest index (0.968 ± 0.01).The high solidity values indicate that the seed shape has low irregularity on its outer edges.
The first two principal components were selected because they represented 90.6% of the data variation [Components 1 (60.9%) and 2 (29.7%)] (Figure 3).The seed width, length, perimeter, area, and solidity were highly correlated in the second quadrant (Q2), whereas shape variables of roundness and circularity were highly correlated in the first quadrant (Q1), except for aspect ratio (Q3).Therefore, measurements of the area, perimeter, length, or width were fundamental in capturing most of the existing variations in seed size, whereas circularity and aspect ratio or roundness were important for the seed shape.The seeds of the different populations were widely spread, but it was possible to verify the concentration of each population (different colors).The clustering of populations according to the Euclidean distance distinguished the formation of two groups (group I: POP1, POP3, and POP4; group II: POP2 and POP5) at a cut-off of 70% (Figure 4).Euclidean distance analysis showed that populations POP2 and POP5 were more similar to each other (1.552), while POP1 and POP2 were more divergent (5.788) according to the biometric descriptors evaluated (Table 2).A positive and significant Mantel correlation (r M = 0.613, p < 0.05) was observed between the biometric descriptors of L. leucocephala seeds and the geographic distance of the studied populations.Additionally, we observed that it was possible to identify the seed origin of L. leucocephala populations from the analysis of the biometric traits with 80.4% accuracy (Kappa statistic 0.755) when we applied the decision tree algorithm.The accuracy of correct seed identification ranged from 75.4 to 84.3% (Table 3).

Discussion
This study evaluated the seed traits of spatially dispersed populations using digital images of L. leucocephala and analyzed their implications for genetic studies.We found that distant populations differed in their seed traits.Diversity in the shape and size of seeds is related to the genetic characteristics of the populations and habitats in which they were found.The phenotypic diversity of L. leucocephala seeds can be attributed to the pantropical distribution of the species (Fonseca & Jacobi, 2011;Crawford et al., 2015;Wan & Wang, 2018) which favors rapid establishment in new environments.
Related studies have demonstrated an association between the biometric characteristics of fruits and seeds and genotype distinctions, as demonstrated in this study.For instance, different populations of Myrtus communis L. in Palermo, Italy were evaluated for genetic diversity and morphological characteristics of fruits and seeds.Two of the main groups were significantly correlated with biometric characteristics, which can be used as morphological markers for fruit production in selection and breeding programs (Melito et al., 2016).In a study of the morphological diversity of Argania spinosa Skeels, it was possible to differentiate genotypes based on the characteristics of the fruits for preservation in situ in the Admine Reserve, Morocco (Metougui, Mokhtari, Maughan, Jellen, & Benlhabib, 2017).
The establishment of exotic species in new areas increases population distribution.It contributes to the genetic divergence among individuals (Wan & Wang, 2018), as reported in the Rhône Valley, France.In this study, populations of A. artemisiifolia showed high intrapopulation genetic diversity in the original introduction area, with a reduction in the range of natural expansion/colonization of the species from sequential bottlenecks; which was expected at this stage of expansion when superior genotypes were selected (Genton et al., 2005).
Biometrics of forest seeds have been reported with sample numbers equal to or greater than 100 seeds (Santana et al., 2013;Costa et al., 2016;Dutra et al., 2017;Menegatti et al., 2017;Roveri-Neto & Paula, 2017;Silva et al., 2017;Noronha et al., 2018;Correia et al., 2019;Zuffo et al., 2019;Rosa et al., 2020).This is because of the difficulty in analyzing seeds using traditional methods; therefore, the image processing technique is a suitable alternative for the analysis of many samples.In addition, the biometric analysis of seeds using images also provides further information related to the size and shape of seeds, whereas the manual method using digital calipers is limited to analyzing only length, width, and thickness (Felix et al., 2020).Thus, digital image processing allows for the exploration of several seed descriptors if there is a reference plane for the captured images.
The results obtained through seed imaging and multivariate analysis showed that seeds provide important information about the genetic divergence between L. leucocephala populations.This is because multivariate analyses were used, which, through simplified functions, demonstrate trends and variations in the data to estimate the divergence between genotypes (Costa et al., 2016).The two principal components were adequate because they represent 90.6% of the data variation, taking into account the assumptions of having an eigenvalue greater than 1.0 (Kaiser, 1960), and concentrating at least 70% of the total variation between variables (Medina et al., 2010).
Clustering methods were efficient in verifying the existence of genetic variability and similarity between genotypes as well as the principal component to concentrate the number of variables to facilitate the interpretation of results.The results obtained through seed image analysis and multivariate analysis showed that seeds provide important information about the adaptation and evolution of Anchusa L. taxa in Sardinia (Farris et al., 2020), and helped to identify seeds from different locations (Pan et al., 2021;Cecco, Musciano, D'Archivio, Frattaroli, & Martino, 2019), and the distribution of species from Central Europe (Mazur, Marcysiak, Dunajska, Gawlak, & Kałuski, 2022).In the present study, seeds from closer populations also presented similar biometric characteristics.
A significant Mantel correlation suggests an isolation-by-distance pattern.In other words, the higher the geographical distance, the more divergent the seeds were with respect to the evaluated parameters.Individuals of L. leucocephala do not reach a large size, are considered a pioneer heliophyte species of autochoric and zoochoric reproduction, and can grow in stress conditions that limit its expansion in nondegraded or disturbed areas (Costa & Durigan, 2010).Therefore, there were divergences between the assessed populations based on the biometric descriptors of the seeds, and geographic distribution influenced the variability between populations of L. leucocephala.Our results suggest that the analysis of seed images can also be tested in other forest species, aiming to estimate the degree of genetic variability between native, exotic, or invasive populations.
The results of machine learning demonstrated the effectiveness of recognizing seeds from different populations using biometric size and shape descriptors.Finally, the use of machine learning automates the recognition of seeds from distinct geographic origins, which has also been proven for L. leucocephala.In view of this, new studies at the continental level or between biogeoclimatic zones will contribute to the understanding of the expansion of invasive species that cause agricultural production losses and invade forest ecosystems.

Conclusion
Image analysis was efficient in detecting biometric differences in L. leucocephala seeds from distinct locations.Therefore, this method is promising for discriminating forest tree populations associated with machine learning, supporting management activities, and studying population genetic divergence.Additionally, digital imaging analysis contributes to understanding genotype-environment interactions and, consequently, to identifying the ability of an invasive species to spread in a new area, making it possible to track and monitor the flow of seeds between populations and other sites.In the future, the analysis of seed images can also be tested in other forest species to estimate genetic variability between native, exotic, or invasive populations.

Figure 2 .
Figure 2. Steps in digital image processing to obtain biometric descriptors in Leucaena leucocephala seeds.A: photo capture; B: conversion to 8-bit format; C: inversion of the gray spectrum; D: use of the threshold mask; E: measurements of the seed perimeter, area, solidity (area/convex area); length, width, aspect ratio (major axis/minor axis), circularity (4π [area] [perimeter]²) and roundness

Figure 3 .
Figure 3. Principal component analysis (PC1 x PC2) based on the digital biometric analysis of Leucaena leucocephala seeds.

Figure 4 .
Figure 4. Cluster analysis based on Euclidean distance for five populations of Leucaena leucocephala and their biometric descriptors (formation of groups at 70% cut-off level).

Table 1 .
Biometrics of seeds from different populations of Leucaena leucocephala analyzed using a digital image (values are written as the mean ± standard deviation).

Table 2 .
Euclidean distance and geographic distance between five Leucaena leucocephala populations.