Authors:
(1) Sergey Kucheryavskiy, Department of Chemistry and Bioscience, Aalborg University and a Corresponding author ([email protected]);
(2) Sergei Zhilin, CSort, LLC., Germana Titova st. 7, Barnaul, 656023, Russia and Contributing authors0 ([email protected]).
Editor’s note: This is Part 3 of 4 of a study detailing a new method for the augmentation of numeric and mixed datasets. Read the rest below.
Table of Links
- Abstract and 1 Introduction
- 2 Methods
- 2.1 Generation of PV-sets based on Singular Value Decomposition
- 2.2 Generation of PV-sets based on PLS decomposition
- 3 Results
- 3.1 Datasets
- 3.2 ANN regression of Tecator data
- 3.3 ANN classification of Heart data
- 4 Discussion
- 5 Conclusions and References
3.1 Datasets
Two datasets were selected to demonstrate the capabilities of PV-sets to be used for data augmentation. Since both SVD and PLS decompositions model the variance-covariance structure of X, the proposed augmentation approach will work best for datasets with moderate to high degree of collinearity, which was taken into account for the selection of the datasets.
It must be noted, that optimization of the ANN architecture was outside the scope of this work. In both examples a very simple architecture with several layers was used.
All calculations were carried out using Python 3.10.12 supplemented with packages PyTorch[12] 2.0.1, NumPy[13] 1.26.0 and prcv 1.1.0. Statistical analysis of the results was performed in R 4.3.1.
Python scripts used to obtain the results, presented in this chapter, are freely available so everyone can reproduce the reported results.
3.1.1 Tecator
The Tecator dataset consists of spectroscopic measurements of finely minced meat samples with different moisture, protein and fat content. The measurements were taken by Tecator Infratec Food and Feed Analyzer working by the Near Infrared Transmission (NIT) principle. The dataset was downloaded from the StatLib public dataset archive (http://lib.stat.cmu.edu/datasets/).
The dataset contains the spectra and the response values for 215 meat samples in total (subsets labeled as C, M and T in the archive), of which 170 samples are used as the training set and 45 are used as the independent test set to assess the performance of the final optimized models. Fat content (as % w/w) was selected as the response variable for this research.
The matrix of predictors X contains absorbance values measured at 100 wavelengths in the range of 850–1050 nm.
According to the Beer-Lambert law, there is a linear relationship between the concentration of chemical components and the absorbance of the electromagnetic radiation. Therefore, the prediction of fat content can in theory be performed by fitting the dataset with a multiple linear regression model. However, the shape of the real spectra suffers from various side effects, including noise, light scattering from an uneven surface of the samples, and many others.
Figure 1 shows the spectra from the training set in the form of line plots colored according to the corresponding fat content using a color gradient from blue (low content) to red (high content) colors. As one can see, there is no clear association between the shape of the spectra and the response value.
Handling such data necessitates the careful selection and optimization of preprocessing methods to eliminate undesirable effects and reveal the information of interest. However, the use of nonlinear models, such as artificial neural networks, has the capability to automatically unveil the needed information [14], bypassing the need for extensive preprocessing steps.
Nonlinearity is not the sole factor in play. Preprocessing can be considered as a feature extraction procedure. It is known that in networks with many hidden layers, the initial layers act as feature extractors. The greater the number of layers, the more intricate features can be derived from raw data. However, it is worth noting that such models demand a substantial number of measurements or observations to achieve satisfactory performance.
The Tecator dataset will be employed for a thorough investigation of how the use of PV-set based data augmentation can attack the problem and how the two main parameters of the PV-set generation procedure (number of latent variables, number of segments) as well as the number of generated PV-sets influence the performance of the ANN regression models.
Heart disease
The Heart dataset came from a study of 303 patients referred for coronary angiography at the Cleveland Clinic. The patients were examined by a set of tests, including physical examination, electrocardiogram at rest, serum cholesterol determination and fasting blood sugar determination. The dataset consists of the examination results combined with historical records. More details about the data can be found elsewhere [15] [16]. The dataset is publicly available from the UC Irvine Machine Learning Repository [17]. The original data include records from several hospitals, in this research only data from the Cleveland Clinic are used.
Eleven records with missing values were removed from the original data, resulting in 292 rows. The dataset consists of 14 numeric and categorical variables, the overview is given in Table 1. The Class variable is used as a response, and the rest are attributed to predictors
This dataset is used to demonstrate that PV-set augmentation can also be applied to mixed datasets with categorical variables, and to show how SVD and PLS versions work for solving binary classification tasks.
3.2 ANN regression of Tecator data
As with most of the spectroscopic measurements, the Tecator data are highly collinear, and SVD decomposition of the matrix with spectra resulted in the first eigenvalue of 25.6, while starting from the 6th all eigenvalues are significantly below 0.001. PLS decomposition of the training set resulted in similar outcomes, therefore, one can assume that using 5-6 latent variables is enough to capture the majority of the predictors’ variation.
At the same time, preliminary investigation has shown that ck/c values are within the desired limit of [0, 2] for all latent variables up to A = 50. Thus, using any number of latent variables between 5 and 50 will produce a reasonable PV-set.
To predict fat content a simple ANN model was employed. The model consisted of six fully connected (linear) layers of the following sizes: (100, 150), (150, 200), (200, 150), (150, 100), (100, 50), and (50, 1). The first value in the parenthesis denotes the number of inputs and the second value denotes the number of outputs in each layer. The first five layers were supplemented with the ReLU activation function, while the last layer was used as the output of the model. All other characteristics of the model are shown in Table 2.
The model contains approximately 95000 tunable parameters in total, which is much larger than the number of samples in the training set (170). However, as already noted, first several hidden layers of ANN usually serve as feature extractors, so not all parameters will be a part of the regression model itself.
The columns of the predictors were standardized using mean and standard deviation values computed for the original training set. The response values were mean centered only to obtain errors directly comparable in the original units (%w/w).
First, the ANN training procedure was run 30 times using the original training set without augmentation. Repeated experiments are necessary because the ANN training procedure is not deterministic, as explained in the introduction, and the performance varies from model to model.
After that, a full factorial experiment was set up to determine how the augmentation of the training data with PV-sets influences the performance of the ANN model and which parameters for PV-set generation as well as the number of generated sets have the largest effect on the performance. The PV-sets were generated using PLS decomposition. The generation parameters are shown in Table 3.
All possible combinations of the values from the table were tested (60 combinations in total). For every combination the training/test procedure was repeated 5 times with full reinitialization, resulting in 300 outcomes.
Figure 2 graphically illustrates the outcomes of the experiment in the form of boxplots. Every plot shows the variation in the RMSEP for a given value of one of the three tested parameters. The black point inside each box supplemented with text shows the corresponding median value.
It is clear that the number of PV-sets used for data augmentation (first plot in the figure) has the largest effect on the RMSEP. Thus, augmenting the original training set with one PV-set reduces the median test set error by approximately 18% (from 4.17 to 3.41 %w/w). Adding five PV-sets reduces the median RMSEP down to 1.58 %w/w (62% reduction). Using 10 PV-sets further reduces the RMSEP down to 1.36 %w/w; however, the effect is smaller.
Changing the number of latent variables does not give a significant difference in the performance of the models, while the number of segments in cross-validation resampling shows a small effect. The best model was obtained by using data augmented with 20 PV-sets generated with 10 latent variables and 10 segments. It could predict fat content in the test set with RMSEP = 0.94 %w/w (R2 = 0.995) which is more than 3 times smaller compared to the best model obtained without augmentation.
Statistical analysis of the outcomes carried out by N-way analysis of variance (ANOVA) and Tukey’s honest significance difference (HSD) test confirmed the significant difference between average RMSE values obtained using different numbers of PV-sets (ANOVA p-value ≪ 0.01). However, pairwise comparison shows no significant difference between 10 vs. 20, 10 vs. 50, and 20 vs 50 PV-sets (p-value > 0.20 in for all three pairs).
The number of latent variables has no significant effect (ANOVA p-value ≈ 0.71). The number of segments shows only a significant difference between 2 segments and other choices, favoring the use of 4 segments or above (p-values for all pairs with 2 segments ≪ 0.01, for 4 segments vs. 10 segments p-value ≈ 0.26)
Optimization of the ANN learning parameters (by varying the learning rate and the batch size) toward reducing RMSEP values for the models trained without augmentation, made this gap smaller. Figure 3 shows the results obtained using a learning rate = 0.001. Training the model without augmentation results with median RMSEP = 2.84 %w/w. Training the model on the augmented data with 20 PV-sets resulted in a median RMSEP = 1.80 %w/w. This is still approximately 37% smaller; however, the overall performance of the model trained with this learning rate is worse.
This effect is understandable as changing the learning rate and batch size to improve the performance of the model trained on the original data makes the model less sensitive to overfitting and local minima problems but also less flexible, which, in turn, makes the use of augmented data less efficient.
Figure 4 further illustrates this effect. The plot shows the variation in RMSEP values for ANN models trained using different learning rates. Each learning rate is represented by a pair of box and whiskers series. The left (blue) series in each pair illustrates the results obtained using 30 models trained with the original data. The right (red) series in each pair shows the results for 30 models trained using the augmented data (20 PV-sets computed using 10 LVs and 4 segments). New PV-sets were computed at each run to eliminate possible random effects.
This experiment was also repeated using other ANN architectures, including networks with convolution layers. The gap between the performance varied depending on the architecture and the learning parameters; however, in all experiments, a minimum of 20% remained, clearly indicating the benefits of the PV-set augmentation in this particular case.
3.3 ANN classification of Heart data
All categorical variables from the Heart dataset with L levels were converted to L − 1 dummy values [0, 1], so the matrix with predictors, X contained 17 columns in total. The columns of the matrix were mean centered and standardized. SVD decomposition of the matrix indicates moderate collinearity with eigenvalues ranging from 3.43 to 0.77 for the first 10 latent variables.
ANN model for classification of the patient conditions used in this research included six linear layers and one sigmoid layer. The first five layers were supplemented with the ReLU activation function, while the last layer was used as the output of the model. The main characteristics of the model are shown in Table 4. The model has approximately 14700 tunable parameters, while the original dataset has only 292 objects.
Since this dataset does not contain a dedicated test set, the following procedure was employed. At each run, the data objects were split into two groups with healthy persons in one group and sick persons in the other. Then, 75% of records were selected randomly from each group and merged together to form a training set. The remaining 25% of records formed the test set for the run. Classification accuracy, which was computed as a ratio of all correctly classified records to the total number of records in the test set, was used as the performance indicator.
The experimental design was similar to the Tecator experiments. First, the training/test procedure was repeated 30 times using the data without augmentation. Then the ANN models were trained using the augmented data and tested using the randomly selected test set. The PV-sets for augmentation were computed using an algorithm based on PLS decomposition applied to the randomly selected training set. As in the previous chapter, all possible combinations of PV-set generation parameters from Table 2 were tested with 5 replications (300 runs in total).
Figure 5 presents the outcomes of the experiment in the form of boxplots, which show a variation of the test set accuracy computed at each run. One can clearly notice that the accuracy of the models trained on the data without augmentation is very low (median accuracy is 0.50), while augmenting the training data with 20 PV-sets raises the median accuracy to 0.84. The best model obtained in these experiments is also trained on the augmented data and has an accuracy of 0.91.
N-way ANOVA and Tukey’s tests for the outcomes have shown that two parameters — number of PV-sets as well as number of segments in cross-validation resampling have a significant influence on the accuracy (p-value for both factors was ≪ 0.001 in ANOVA test). However, a significant difference was observed only for the number of segments equal to two, similar to the Tecator results. Using four or more segments for the generation of PV-sets shows statistically similar performance to the models trained on the augmented data.
The results are very similar to those obtained using the PLS-based algorithm. However, in this case, the overall performance of the models trained on the augmented data was a slightly higher, with a median accuracy of 0.84 for the models trained with on data augmented with 10, 20 and 50 PV-sets. The best model had an accuracy of 0.95.
Reducing the learning rate down to 0.001 had the same effect as for the data augmented using the PLS-based algorithm — the gap between the models trained with and without augmented data is eliminated however the overall performance also gets lower with, median a accuracy of approximately 0.79.