By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Which Feature Selection Method Performs Best with Changing Data Sizes? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Which Feature Selection Method Performs Best with Changing Data Sizes? | HackerNoon
Computing

Which Feature Selection Method Performs Best with Changing Data Sizes? | HackerNoon

News Room
Last updated: 2025/05/13 at 4:03 PM
News Room Published 13 May 2025
Share
SHARE

Authors:

(1) Mahdi Goldani;

(2) Soraya Asadi Tirvan.

Table of Links

Abstract and Introduction

Methodology

Dataset

Similarity methods

Feature selection methods

Measure the performance of methods

Result

Discussion

Conclusion and References

Discussion

There are three main factors to choose the best method in this research; the value of r-squared, the sensitivity of methods to change in sample size, and fluctuation during change in sample size. Among the methods, on average, variance, stepwise, and correlation methods had the highest r-squared value. The mutual information, var, simulated, edit distance, and Hausdorff methods had less sensitivity to data size. The Variance, simulated, and edit distance had less fluctuation. According to these metrics, the var method had all three criteria. Among the similar methods, Hausdorff and edit distance had good performance in comparison to other methods.

This study focuses on improving the performance of the data-driven model by selecting the most appropriate features. However, it is important to acknowledge certain limitations of this study. First, many existing hybrid methods should be investigated. Second, the results presented may change as the dataset changes.

Conclusion

The aim of this study is to select feature selection techniques that have little sensitivity to low data size and select a subset of data that has high predictive performance in low data size. As mentioned in the results section, based on the dataset used in this study, the performance of standard feature selection methods fluctuates in different data volumes, which can reduce the level of confidence in these techniques. Among the ten standard feature selection methods, two variance and simulated methods are more stable than others. The graphs show that the similarity methods introduced as an alternative to the feature selection methods are less volatile than these methods, which increases the confidence in the results of these methods when the data size is different. Therefore, according to the first approach of this study, the similarity methods are more reliable than the usual feature selection methods. Of course, it is essential to note that these results are only related to one data set, and the results may change in other data sets.

The second and more critical approach that the article sought to test is to examine the sensitivity of the methods to the change in data size. Indeed, any method with the least minor sensitivity to data size change will be chosen. According to trendline results, the variance, correlation methods, simulated methods, edit distance, and Hausdorff are less sensitive to observation size. Considering that time series similarity methods had the most minor fluctuation among other feature selection methods, these methods can be used as reliable methods for feature selection. Similarity methods, such as the Hausdorff and edit distance approaches, emerged as the most stable among the various feature selection techniques evaluated. This robustness across different data sizes underscores their reliability and suitability for this research context. Their consistent performance indicates that they can effectively handle fluctuations in observation numbers without significant loss of predictive accuracy. This resilience is crucial in ensuring the robustness and generalizability of predictive models, particularly in dynamic environments such as financial markets. Consequently, these methods stand out as promising tools for feature selection, offering researchers a dependable approach to identifying relevant variables for predictive modeling tasks, such as forecasting Apple’s closing price.

References

  1. E. W. Newell and Y. Cheng, (2016) Mass cytometry: blessed with the curse of dimensionality: Nature Immunology, 17: 890-895. https://doi.org/10.1038/ni.3485

  2. B. Remeseiro and V. Bolon-Canedo, (2019) A review of feature selection methods in medical applications: Computers in biology and medicine. 112. https://doi.org/10.1016/j.compbiomed.2019.103375

  3. Zhu X, Wang Y, Li Y, Tan Y, Wang G, Song Q. (2019)A new unsupervised feature selection algorithm using similaritybased feature clustering: Comput Intell. 35(1): 2-22. doi:10.1111/COIN.12192

  4. Mitra, Pabitra, C. A. Murthy, and Sankar K. Pal. (2002) Unsupervised feature selection using feature similarity: IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 3: 301-312. DOI: 10.1109/34.990133

  5. Yu, Qiao, Shu-juan Jiang, Rong-cun Wang, and Hong-yang Wang. (2017) A feature selection approach based on a similarity measure for software defect prediction: Frontiers of Information Technology & Electronic Engineering 18, 11: 1744-1753. https://doi.org/10.1631/FITEE.1601322

  6. Shi, Yuang, Chen Zu, Mei Hong, Luping Zhou, Lei Wang, Xi Wu, Jiliu Zhou, Daoqiang Zhang, and Yan Wang. (2022) ASMFS: Adaptive-similarity-based multi-modality feature selection for classification of Alzheimer’s disease: Pattern Recognition 126: 108566. https://doi.org/10.1016/j.patcog.2022.108566

  7. Fu, Xuezheng, Feng Tan, Hao Wang, Yanqing Zhang, and Robert W. Harrison. (2006) Feature Similarity Based Redundancy Reduction for Gene Selection: In Dmin, pp. 357-360.

  8. Vabalas, Andrius, Emma Gowen, Ellen Poliakoff, and Alexander J. Casson. (2019) Machine learning algorithm validation with a limited sample size: PloS one 14, 11: e0224365.

  9. Perry, George LW, and Mark E. Dickson. (2018) Using machine learning to predict geomorphic disturbance: The effects of sample size, sample prevalence, and sampling strategy: Journal of Geophysical Research: Earth Surface 123(11): 2954-2970. https://doi.org/10.1029/2018JF004640

  10. Cui, Zaixu, and Gaolang Gong. (2018) The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features: Neuroimage 178: 622-637. https://doi.org/10.1016/j.neuroimage.2018.06.001

  11. Kuncheva, Ludmila I., Clare E. Matthews, Alvar Arnaiz-González, and Juan J. Rodríguez. (2020) Feature selection from high-dimensional data with very low sample size: A cautionary tale. arXiv preprint arXiv:2008.12025.

  12. Kuncheva, Ludmila I., and Juan J. Rodríguez. (2018) On feature selection protocols for very low-sample-size data: Pattern Recognition, 81: 660-673. https://doi.org/10.1016/j.patcog.2018.03.012

  13. J. Doak, (1992) An evaluation of feature selection methods and their application to computer security: Technical Report CSE-92-18.

  14. H. Liu and L. Yu, (2005) Toward integrating feature selection algorithms for classification and clustering, IEEE Transactions on knowledge and data engineering. 17:491-502. 10.1109/TKDE.2005.66

  15. C. F. Tsai and Y. T. Sung, (2020) Ensemble feature selection in high dimension, low sample size datasets: Parallel and serial combination approaches, Knowledge-Based Systems. 203. https://doi.org/10.1016/j.knosys.2020.106097

  16. U. Mori, A. Mendiburu and J. A. Lozano, (2015) Similarity measure selection for clustering time series databases: IEEE Transactions on Knowledge and Data Engineering, 28:181-195. 10.1109/TKDE.2015.2462369

  17. Goldani, Mehdi,( 2022) a review of time series similarity methods, the third international conference on innovation in business management and economics.

  18. Palkhiwala, S., Shah, M. & Shah, M. (2022)Analysis of Machine learning algorithms for predicting a student’s grade. J. of Data, Inf. and Manag. 4, 329–341. https://doi.org/10.1007/s42488-022-00078-2

  19. A.C. Rencher, W.F.(2012) Christensen, Methods of Multivariate Analysis, Wiley Series in Probability and Statistics, John Wiley & Sons. p. 19.

This paper is available on arxiv under CC BY-SA 4.0 by Deed (Attribution-Sharealike 4.0 International) license.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Walmart selling $99 garden buy to turn outdoor space into a private haven
Next Article Academics ‘fear students becoming too reliant on AI tools’
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

This is your best chance to get a Switch 2 at launch if you didn’t preorder
News
Google finally confirms when we can expect Android 16 to drop
News
Red Hat Enterprise Linux 10 Reaches GA
Computing
Musk joins Trump on Saudi trip
News

You Might also Like

Computing

Red Hat Enterprise Linux 10 Reaches GA

1 Min Read
Computing

TME reports strong subscription growth in Q3, but faces decline in monthly active users · TechNode

2 Min Read
Computing

AMD Begins Process Of Preparing The Linux Kernel For Zen 6 CPUs

2 Min Read
Computing

China’s Baidu could launch commercial self-driving taxi service in UAE: report · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?