Authors:
(1) Sergey Kucheryavskiy, Department of Chemistry and Bioscience, Aalborg University and a Corresponding author ([email protected]);
(2) Sergei Zhilin, CSort, LLC., Germana Titova st. 7, Barnaul, 656023, Russia and Contributing authors0 ([email protected]).
Editor’s note: This is Part 4 of 4 of a study detailing a new method for the augmentation of numeric and mixed datasets. Read the rest below.
Table of Links
- Abstract and 1 Introduction
- 2 Methods
- 2.1 Generation of PV-sets based on Singular Value Decomposition
- 2.2 Generation of PV-sets based on PLS decomposition
- 3 Results
- 3.1 Datasets
- 3.2 ANN regression of Tecator data
- 3.3 ANN classification of Heart data
- 4 Discussion
- 5 Conclusions and References
4 Discussion
The experimental results confirm the benefits of PV-set augmentation, however optimization of ANN learning parameters is needed to make the benefits significant. At the same time, optimization of the PV-set generation algorithm is not necessary for
most of the cases. Based on our experiments, we advise using cross-validation resampling with 5 or 10 splits and a number of latent variables large enough to capture the majority of variation in X. In some specific cases one can use tools for quality control of generated PV-sets described in [10].
It must also be noted that the use of PV-sets for data augmentation is not always beneficial. Thus, according to our experiments, which are not reported here, in methods that are robust to overfitting, such as, for example, random forest (RF), increasing the training set artificially does not have a significant effect on the model performance. In the case of eXtreme Gradient Boosting changing training parameters, which regulate the overfitting, such as a learning rate, maximum depth and minimum sum of instance weight, can have an effect, but most of the time the effect observed in our experiments was marginal.
5 Conclusions
This paper proposes a new method for data augmentation. The method is beneficial specifically for datasets with moderate to high degree of collinearity as it directly utilizes this feature in the generation algorithm.
Two proposed implementations of the method (SVD and PLS based) cover most of the common data analysis tasks, such as regression, discrimination and one-class classification (authentication). Both implementations are very fast — the generation of a PV-set for X of 200×500 with 20 latent variables and 10 segments splits requires several seconds (less than a second on a powerful PC), much less than the training of an ANN model with several layers.
The method can work with datasets of small size (from tens observations) and can be used for both numeric and mixed datasets, where one or several variables are categorical.
References
1] Ratner, A. J., Ehrenberg, H. R., Hussain, Z., Dunnmon, J. & R´e, C. Learning to compose domain-specific transformations for data augmentation (2017). 1709. 01643.
[2] Goodfellow, I. J. et al. Generative adversarial networks (2014). 1406.2661.
[3] Dao, T. et al. A kernel theory of modern data augmentation (2019). 1803.06084.
[4] Perez, E. & Ventura, S. Progressive growing of generative adversarial networks for improving data augmentation and skin cancer diagnosis. Artificial Intelligence in Medicine 141, 102556 (2023). URL https://www.sciencedirect.com/science/ article/pii/S0933365723000702.
[5] Perez, F., Vasconcelos, C., Avila, S. & Valle, E. Stoyanov, D. et al. (eds) Data augmentation for skin lesion analysis. (eds Stoyanov, D. et al.) OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, 303–311 (Springer International Publishing, Cham, 2018).
[6] Iglesias, G., Talavera, E., Gonzalez-Prieto, A., Mozo, A. & Gomez-Canaval, S. Data augmentation techniques in time series domain: a survey and taxonomy. Neural Computing and Applications 35, 10123–10145 (2023). URL https://doi. org/10.1007/s00521-023-08459-3.
[7] Saiz-Abajo, M., Mevik, B.-H., Segtnan, V. & Næs, T. Ensemble methods and data augmentation by noise addition applied to the analysis of spectroscopic data. Analytica Chimica Acta 533, 147–159 (2005). URL https://www.sciencedirect. com/science/article/pii/S000326700401428X.
[8] Chadebec, C. & Allassonniere, S. Data augmentation with variational autoencoders and manifold sampling (2021). 2103.13751.
[9] Kucheryavskiy, S., Zhilin, S., Rodionova, O. & Pomerantsev, A. Procrustes cross-validation—a bridge between cross-validation and independent validation sets. Analytical Chemistry 92, 11842–11850 (2020).
[10] Kucheryavskiy, S., Rodionova, O. & Pomerantsev, A. Procrustes cross-validation of multivariate regression models. Analytica Chimica Acta 1255, 341096 (2023). URL https://www.sciencedirect.com/science/article/pii/S0003267023003173.
[11] de Jong, S. Simpls: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18, 251–263 (1993).
[12] Paszke, A. et al. in Pytorch: An imperative style, highperformance deep learning library 8024–8035 (Curran Associates, Inc., 2019). URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[13] Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020). URL https://doi.org/10.1038/s41586-020-2649-2.
[14] Borggaard, C. & Thodberg, H. H. Optimal minimal neural interpretation of spectra. Analytical Chemistry 64, 545–551 (1992).
[15] Detrano, R. et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal of Cardiology 64, 304–310 (1989). URL https://www.sciencedirect.com/science/article/ pii/0002914989905249.
[16] Detrano, R. et al. Bayesian probability analysis: a prospective demonstration of its clinical utility in diagnosing coronary disease. Circulation 69, 541–547 (1984).
[17] Janosi, A., Steinbrunn, W., Pfisterer, M. & Detrano, R. Heart Disease Dataset (1988).