GENES - a software package for analysis in experimental statistics and quantitative genetics

. GENES is a software package used for data analysis and processing with different biometric models and is essential in genetic studies applied to plant and animal breeding. It allows parameter estimation to analyze biological phenomena and is fundamental for the decision-making process and predictions of success and viability of selection strategies. The program can be downloaded from the Internet (http://www.ufv.br/dbg/genes/genes.htm or http://www.ufv.br/dbg/biodata.htm) and is available in Portuguese, English and Spanish. Specific literature (http://www.livraria.ufv.br/) and a set of sample files are also provided, making GENES easy to use. The software is integrated into the programs MS Word, MS Excel and Paint, ensuring simplicity and effectiveness in data import and export of results, figures and data. It is also compatible with the free software R and Matlab, through the supply of useful scripts available for complementary analyses in different areas, including genome wide selection, prediction of breeding values and use of neural networks in genetic improvement.


Introduction
To breed genetically superior plants, the selected individuals must simultaneously unite a series of properties to produce a comparatively higher yield and to meet consumer demands.A way to increase the chances of success of a breeding program is to perform reliable experiments, generating a great volume of experimental data.Based on an adequate processing of these data, genetic parameters can be estimated and biological phenomena interpreted.In this phase of result analysis and interpretation, appropriate software systems and computer resources are of utmost importance.
The development of software in the field of plant Genetics and Breeding is crucial due to the scarcity of such resources available to the scientific community.The availability of such tools would supply the increasing demand of users in numerous research institutions who deal with an enormous volume of data, requiring adequate ways of processing to accurately estimate statistical and biological parameters.
Particularly in the case of plant genetics, it is noted that the intensive breeding of many species and the complexity of the most important traits require the use of increasingly accurate selection criteria.In all breeding stages, breeders must use information that is expressed in parameters of the biometric models, which are usually available in the output of most scientifically oriented software systems.

Description
The software GENES is compatible with IBM PCs and requires the Windows operating system.Some configuration settings are indispensable, such as a screen resolution of 1024 x 768 (large fonts) and the use of a decimal symbol expressed by points.The package comprises 257 executable projects, 131 text documents in rtf format, occupying about 285Mbytes, available in English and Portuguese.

Application of the program
An application of the program Genes usually includes the following steps: a. Examples of data files: Examples of data files to be processed by Genes are available, which are particularly useful in initial studies, for providing a double learning effect about the operation of the application itself and of the statistical and biometrical techniques used.Each procedure is represented by an icon that accesses the file containing an illustrative example of a particular procedure, with the advantage of the complete description of all its parameters for immediate data analysis.
b. Supplying data for processing: The procedures generally have a common sequence of data analysis.Basically, the user provides the name of the file containing the data to be processed, information about the parameters (number of variables, treatments, blocks etc.), the names of the variables (optional), and then prints or saves the results.It is recommended that these data files should be in .txtformat, but they are importable from excel spreadsheets.
The data are supplied in a file containing a data spreadsheet, in which each column represents a certain characteristic to be analyzed and each row the experimental observation.Sometimes, the first columns are reserved to describe classificatory variables or descriptor effects, e.g., treatments, blocks, years, locations etc. c.Parameter Description: For each procedure, the user must provide specific information on the data file that will be used in processing.For each procedure, specific information is requested.Thus, for example, to perform variance analysis, the user should provide the number of variables to be analyzed, the number of treatments and the number of blocks or replications.In other procedures, other information will be solicited, but in the different procedures, the control buttons on the right side of the screen are common.These buttons represent: Return: ends operations on the screen of parameter identification.
Read Data: reads the file data, considering all rows and columns.This option is useful to identify gaps in the structure of the data file.Through statistics indicating average, maximum and minimum, possible typos can be detected.An error in data reading would definitely lead to errors in data processing, so that the user must apply the necessary corrections, according to the specification of each procedure, to ensure correct data reading and analyses.
d. Definition of names of variables: After providing the information in the procedure of parameter identification, the user can name the variables analyzed.If the variables are not named, the program will apply the description: X 1 , X 2 , … , X n .
e. Result output: Results are provided by a proper editor of the program Genes.However, the output file can be exported to Word, allowing the use of all features of this powerful editor.In this case, we present the results in a file with font Courier New 8, with customized heading and page numbering.Results can also be exported to Excel or Wordpad and diagrams and figures to Excel or Mspaint.

Modules
The Genes software system contains analysis modules that involve several procedures of biometric analysis, as described below.

Biometrics
Biometrics is the application of statistics to the biological field, being essential for planning, assessment and interpretation of all data obtained in research in the biological area.A growing user demand is noted in various research institutions in the biological area, especially with a view to genetic studies, which deal with large data volumes.This requires an adequate processing, to ensure an accurate estimation and interpretation of the statistical and biological parameters.But there is a market gap for software that would supply this demand.In this context, the program GENES was developed to cover mainly the area of biometrics, with numerous procedures for an adequate data processing.The following procedures are available in this module: a. Genotype x Environment Interaction: stratification analysis, dissimilarity and correlations between environments.
e. Analysis of Segregating and non-segregating generations: scale joint test (P1, P2, F1, F2 with optional inclusion of BC1 and BC2), analysis of experiments of segregating lines and parents in alternating rows and analysis of plants in generation Ft and the derived Ft+1 lines.
f. Repeatability: Analysis of original or classified data.
g. Combined selection: analysis of experiments of families with balanced and unbalanced data.Analysis of genetic design proposed by Comstock and Robinson (1948).
h. Genetic and Environmental Progress.i. Nuclear Collection.

Multivariate Analysis
The designation multivariate analysis represents a large number of methods and techniques that simultaneously use all variables in the analysis, interpretation and processing of the data set from a biological phenomenon under study.The mathematical complexity, typical of multivariate methods, has inhibited the transfer of the underlying stochastic fundamentals and principles to the researchers.However, the key part, which is the statistical inference, has been stimulated through the use of well-constructed software with a userfriendly interface for researchers.In the program Genes, the scientist will find the following: a) Analysis of structural simplification: Principal Components and Canonical Variable Analysis.
b) Association Analysis: Path analysis, Canonical Correlations and Factor analysis c) Analysis of diversity: Discriminant Analysis (by the method proposed by Anderson or based on principal components).Measures of Dissimilarity: based on continuous, multicateegoric or binary phenotipic quantitative variables.Analysis of molecular data from dominant or codominant markers; cluster analysis: Tocher optimization method, hierarchical, graphic dispersion and 2D and 3D projection.Identification of more and less similar accessions.Importance of traits: by main components or the distance by Mahalanobis' Generalized distance and canonical variable analysis.

Simulation
One the major contributions of computing is that phenomena can be studied by simulating a complex situation in which parameters and constraints are established, so that the effect of certain controllable factors can be conveniently studied.Simulation is defined as a way of imitating the behavior of a real system by computational resources to study its functioning under alternative conditions, involving certain types of logic models to describe, as best as possible, the natural system .
Simulations are highly useful in genetic studies in various contexts, including studies of populations, the individual or of the proper genome.They require the development of appropriate biological models to represent the phenomena of interest as ideally as possible by researchers and suitable procedures of processing by programmers, according to the parameters and constraints, so that the influence of certain factors can be assessed.
Genes Studies on diversity can be directed to plant breeding, evolutionary associations, conservation and management of plant material, among other purposes.In each case, an adequate methodology and appropriate information are required.The data of measured units, plants, accessions or taxa can be phenotypic or genotypic.Phenotypic information is derived from the evaluation of characteristics with continuous or discrete distributions, of which the latter can be multicategoric or binary.Genotypic data are obtained from molecular markers or DNA sequencing.In the case of markers, there are dominant or co-dominant and diallelic or multiallelic types.All these situations are addressed in the application Genes, by the approach: a. Diversity between accessions: based on continuous, multi-categoric, binary phenotypic variables, and analysis of data of dominant and codominant (multi-allelic) markers.
e. Discriminant Analysis: discriminant analysis of Anderson, analysis based on main components or in k-nearest neighbors.Discriminant analyses from the dissimilarity matrices.
f. Grouping analysis: using the following methods: Tocher optimization and hierarchical methods, by graphic dispersion, 2D and 3D projection and analysis of more and less similar accessions.Matrices of Dissimilarities: calculation of the correlation and sum between elements of matrices of dissimilarity.Importance of traits: considering phenotypic quantitative characters or molecular information, by means of MANOVA.
g. Optimization: Analysis of the optimal number of binary or multi-allelic markers for the study of genetic variance.
Simulation: simulation of populations, crossings and population samples, under the effect of divergent selection or genetic drift.
h. Relationship coefficient and Hardy-Weinberg Equilibrium: Population analysis based on the information of codominant diallelic or multi-allelic markers.Analysis of Gametic Disequilibrium.

Experimental Statistics
This module contains procedures based on statistical models with wide application in various areas of research and undergraduate and graduate teaching.The importance of statistical analysis is the probabilistic proof of the truth of a particular hypothesis formulated based on extensive studies and on analyses of the research results.In statistics, parameters estimates related to the data are presented and interpreted per se, or hypothesis are tested and results are associated with probability values by means of statistical tests.Usually, the use of a particular inferential statistics is directed by the study question.The software Genes offers the following procedures for statistical analyses:

Matrices
The study of matrices is considered fundamental because it is an important tool in this area of mathematics related to calculations and parameter estimation.It is widely applied in estimation methods and model adjustments, such as least squares and maximum likelihood and different matrix analyses.The following procedures are available in Genes: a. Diagnosis of multicollinearity b.Algebra of matrices c.Solution of the system Integration with other software Currently, the software GENES has 205 executable projects involving the modules of experimental statistics, biometrics, multivariate analysis, genetic diversity, and simulation matrices.Thus, each procedure has a particular data set for which an appropriate biometric template is prepared that will allow the user to process data and generate and properly interpret results of the studied phenomenon.However, additional analyses may be required or even some differentiated form of carrying out the same kind of study may be evaluated.In this case, the user would surely be willing to try a new analysis option provided no effort is required to understand the particular access to an alternative program or supplement.The user of software GENES has direct access to other applications such as: Microsoft Word: designed to receive output results and emit reports Microsoft Excel: designed to receive outputs or results of complementation analysis, in particular graphical analysis.
Microsoft Paint: designed to receive figures, images, and diagrams resulting from the analysis to which, as the researcher sees fit, graphical resources can be applied to improve the aesthetics of the result.
Free software environment R: For each procedure available within software GENES, the user finds a set of instructions for the appropriate settings so that the data can be accessed and processed by program R, according to the researcher's demand.The program R has been increasingly accepted by universities and companies around the world.Nowadays, the acquisition costs of statistical software packages that are similar or even poorer in terms of analysis capacity, are very high, especially for the predominantly small and medium businesses in our country.Thus, the inclusion of this facility in Genes is yet a another contribution to the use of R, intended to break barriers and facilitate the construction of diagrams and data analyses of quality data, at no cost and with the same reliability as of other software.
Matlab ® : Software GENES generates established scripts by a set of sentences or commands to perform or solve problems of a particular type of study based on a set of data or information, within the Matlab program.Matlab is an interactive system whose basic data element is an array that does not require dimensioning.This system allows the resolution of many numerical problems in a fraction of the time one would spend writing a similar program in Fortran, Basic or C.Moreover, solutions to problems are expressed almost exactly as they are written mathematically.Each script consists of set of methods organized and documented by one or more parts of a process allowing, if necessary, the identification and correction of errors by means of debugging the script to obtain a solution with no errors.

Conclusion
GENES is a software package very important for data analysis and processing with different biometric models and is essential in genetic studies applied to plant and animal breeding.The software should supply the increasing demand of users in numerous research institutions who deal with an enormous volume of data, requiring adequate ways of processing to accurately estimate statistical and biological parameters.