Which Feature Selection Method Performs Best With Changing Data Sizes?

Authors:

(1) Mahdi Goldani;

(2) Soraya Asadi Tirvan.

Table of Links

Abstract and Introduction

Methodology

Dataset

Similarity methods

Feature selection methods

Measure the performance of methods

Result

Discussion

Conclusion and References

Discussion

There are three main factors to choose the best method in this research; the value of r-squared, the sensitivity of methods to change in sample size, and fluctuation during change in sample size. Among the methods, on average, variance, stepwise, and correlation methods had the highest r-squared value. The mutual information, var, simulated, edit distance, and Hausdorff methods had less sensitivity to data size. The Variance, simulated, and edit distance had less fluctuation. According to these metrics, the var method had all three criteria. Among the similar methods, Hausdorff and edit distance had good performance in comparison to other methods.

This study focuses on improving the performance of the data-driven model by selecting the most appropriate features. However, it is important to acknowledge certain limitations of this study. First, many existing hybrid methods should be investigated. Second, the results presented may change as the dataset changes.

Conclusion

The aim of this study is to select feature selection techniques that have little sensitivity to low data size and select a subset of data that has high predictive performance in low data size. As mentioned in the results section, based on the dataset used in this study, the performance of standard feature selection methods fluctuates in different data volumes, which can reduce the level of confidence in these techniques. Among the ten standard feature selection methods, two variance and simulated methods are more stable than others. The graphs show that the similarity methods introduced as an alternative to the feature selection methods are less volatile than these methods, which increases the confidence in the results of these methods when the data size is different. Therefore, according to the first approach of this study, the similarity methods are more reliable than the usual feature selection methods. Of course, it is essential to note that these results are only related to one data set, and the results may change in other data sets.

The second and more critical approach that the article sought to test is to examine the sensitivity of the methods to the change in data size. Indeed, any method with the least minor sensitivity to data size change will be chosen. According to trendline results, the variance, correlation methods, simulated methods, edit distance, and Hausdorff are less sensitive to observation size. Considering that time series similarity methods had the most minor fluctuation among other feature selection methods, these methods can be used as reliable methods for feature selection. Similarity methods, such as the Hausdorff and edit distance approaches, emerged as the most stable among the various feature selection techniques evaluated. This robustness across different data sizes underscores their reliability and suitability for this research context. Their consistent performance indicates that they can effectively handle fluctuations in observation numbers without significant loss of predictive accuracy. This resilience is crucial in ensuring the robustness and generalizability of predictive models, particularly in dynamic environments such as financial markets. Consequently, these methods stand out as promising tools for feature selection, offering researchers a dependable approach to identifying relevant variables for predictive modeling tasks, such as forecasting Apple’s closing price.

References

E. W. Newell and Y. Cheng, (2016) Mass cytometry: blessed with the curse of dimensionality: Nature Immunology, 17: 890-895. https://doi.org/10.1038/ni.3485
B. Remeseiro and V. Bolon-Canedo, (2019) A review of feature selection methods in medical applications: Computers in biology and medicine. 112. https://doi.org/10.1016/j.compbiomed.2019.103375
Zhu X, Wang Y, Li Y, Tan Y, Wang G, Song Q. (2019)A new unsupervised feature selection algorithm using similaritybased feature clustering: Comput Intell. 35(1): 2-22. doi:10.1111/COIN.12192
Mitra, Pabitra, C. A. Murthy, and Sankar K. Pal. (2002) Unsupervised feature selection using feature similarity: IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 3: 301-312. DOI: 10.1109/34.990133
Yu, Qiao, Shu-juan Jiang, Rong-cun Wang, and Hong-yang Wang. (2017) A feature selection approach based on a similarity measure for software defect prediction: Frontiers of Information Technology & Electronic Engineering 18, 11: 1744-1753. https://doi.org/10.1631/FITEE.1601322
Shi, Yuang, Chen Zu, Mei Hong, Luping Zhou, Lei Wang, Xi Wu, Jiliu Zhou, Daoqiang Zhang, and Yan Wang. (2022) ASMFS: Adaptive-similarity-based multi-modality feature selection for classification of Alzheimer’s disease: Pattern Recognition 126: 108566. https://doi.org/10.1016/j.patcog.2022.108566
Fu, Xuezheng, Feng Tan, Hao Wang, Yanqing Zhang, and Robert W. Harrison. (2006) Feature Similarity Based Redundancy Reduction for Gene Selection: In Dmin, pp. 357-360.
Vabalas, Andrius, Emma Gowen, Ellen Poliakoff, and Alexander J. Casson. (2019) Machine learning algorithm validation with a limited sample size: PloS one 14, 11: e0224365.
Perry, George LW, and Mark E. Dickson. (2018) Using machine learning to predict geomorphic disturbance: The effects of sample size, sample prevalence, and sampling strategy: Journal of Geophysical Research: Earth Surface 123(11): 2954-2970. https://doi.org/10.1029/2018JF004640
Cui, Zaixu, and Gaolang Gong. (2018) The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features: Neuroimage 178: 622-637. https://doi.org/10.1016/j.neuroimage.2018.06.001
Kuncheva, Ludmila I., Clare E. Matthews, Alvar Arnaiz-González, and Juan J. Rodríguez. (2020) Feature selection from high-dimensional data with very low sample size: A cautionary tale. arXiv preprint arXiv:2008.12025.
Kuncheva, Ludmila I., and Juan J. Rodríguez. (2018) On feature selection protocols for very low-sample-size data: Pattern Recognition, 81: 660-673. https://doi.org/10.1016/j.patcog.2018.03.012
J. Doak, (1992) An evaluation of feature selection methods and their application to computer security: Technical Report CSE-92-18.
H. Liu and L. Yu, (2005) Toward integrating feature selection algorithms for classification and clustering, IEEE Transactions on knowledge and data engineering. 17:491-502. 10.1109/TKDE.2005.66
C. F. Tsai and Y. T. Sung, (2020) Ensemble feature selection in high dimension, low sample size datasets: Parallel and serial combination approaches, Knowledge-Based Systems. 203. https://doi.org/10.1016/j.knosys.2020.106097
U. Mori, A. Mendiburu and J. A. Lozano, (2015) Similarity measure selection for clustering time series databases: IEEE Transactions on Knowledge and Data Engineering, 28:181-195. 10.1109/TKDE.2015.2462369
Goldani, Mehdi,( 2022) a review of time series similarity methods, the third international conference on innovation in business management and economics.
Palkhiwala, S., Shah, M. & Shah, M. (2022)Analysis of Machine learning algorithms for predicting a student’s grade. J. of Data, Inf. and Manag. 4, 329–341. https://doi.org/10.1007/s42488-022-00078-2
A.C. Rencher, W.F.(2012) Christensen, Methods of Multivariate Analysis, Wiley Series in Probability and Statistics, John Wiley & Sons. p. 19.

This paper is available on arxiv under CC BY-SA 4.0 by Deed (Attribution-Sharealike 4.0 International) license.

Which Feature Selection Method Performs Best with Changing Data Sizes? | HackerNoon

Table of Links

Discussion

Conclusion

References

Leave a Reply Cancel reply

Stay Connected

Latest News

This is your best chance to get a Switch 2 at launch if you didn’t preorder

Google finally confirms when we can expect Android 16 to drop

Red Hat Enterprise Linux 10 Reaches GA

Musk joins Trump on Saudi trip

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

Discussion

Conclusion

References

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News