:::info
Authors:
(1) Ahatsham Hayat, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]);
(2) Mohammad Rashedul Hasan, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]).
:::
Table of Links
Abstract and 1 Introduction
2 Method
2.1 Problem Formulation and 2.2 Missingness Patterns
2.3 Generating Missing Values
2.4 Description of CLAIM
3 Experiments
3.1 Results
4 Related Work
5 Conclusion and Future Directions
6 Limitations and References
3 Experiments
We conducted a series of experiments to systematically evaluate the efficacy of CLAIM in addressing the research questions presented in Section 1. Our validation criterion for CLAIM’s effectiveness was the post-imputation performance of pre-trained LLMs finetuned with missing-aware contextual datasets for downstream classification tasks. We focused on three types of missingness mechanisms: MCAR, MAR, and MNAR.
Datasets. We evaluated the performance of CLAIM using seven real-life multivariate classification datasets from the UCI repository [12]. Detailed information on these datasets is provided in the Appendix.
Baseline Imputation Methods. Our approach was compared against a broad spectrum of commonly-used baseline imputation methods, encompassing single imputation (SI) and multiple imputation (MI) techniques, non-ML and ML methods, and both discriminative and generative ML approaches.
SI methods included mean imputation using the feature-wise mean (non-ML), kNearest Neighbors (k-NN) [3] (ML: Discriminative), a tree-based algorithm using MissForest [37] (ML: Discriminative), and a deep generative adversarial network for imputation using GAIN (Generative Adversarial Imputation Nets) [45] (ML: Generative). The MI method employed was MICE (Multiple Imputation by Chained Equations) [22] (ML: Discriminative).
Experimental Settings. The hyperparameter settings for the various imputation methods and the LLM used in our experiments are detailed below.
Hyperparameters for Baseline Imputation Methods. For GAIN, we adhered to the hyperparameters specified in the original publication, setting α to 100, the batch size to 128, the hint rate at 0.9, and the number of iterations to 1000 for optimal performance. MissForest and MICE were configured with their respective default parameters as provided in their PyPI implementations[2]. For k-NN, we chose k = 5 and the Euclidean distance measure based on literature suggesting this configuration offers superior performance [15].
Pre-trained LLM. We utilized the 7 billion-parameter LLaMA 2 model [40], fine-tuning it with the parameter-efficient QLoRA method [11]. The settings were r = 16, α = 64, dropout = 0.1 with the task type set to “CAUSALLM”. The learning rate was 2e-4, using the “pagedadamw_32bit” optimizer.
Experiments were conducted with a batch size of 4 across 50 epochs, considering memory constraints during fine-tuning. Tesla A40 GPUs (48GB RAM) were used for distributed training. For evaluation, we used 20% randomly sampled instances of each dataset. Models were evaluated five times, reporting both average performance and standard deviation.
3.1 Results
Figure 2 displays the experimental outcomes for seven datasets, where we benchmarked CLAIM against existing imputation methods. Performance metrics for LLMs fine-tuned on fully complete datasets (without any missing values, thus no imputation was necessary) were included for comparison. This approach delineates the effectiveness of CLAIM by providing a reference to baseline performances, offering a clearer perspective on the benefits provided by CLAIM over traditional imputation methods.
[RQ1]: How effective is CLAIM in imputing missing values across the distinct missingness mechanisms (MCAR, MAR, and MNAR) and how does it compare with existing imputation methods in terms of accuracy and robustness across varied datasets and missing data scenarios?
MCAR: CLAIM demonstrated superior accuracy in imputing missing values across all datasets compared to baseline imputation methods. Its performance under the MCAR assumption, where missingness is independent of any data, suggests that CLAIM efficiently leverages the contextual information inherent in the dataset for imputation. This efficiency is particularly evident in its ability to significantly close the gap towards the performance of fully complete datasets (no imputation), showcasing its effectiveness
MAR: Under MAR, where missingness depends on observed data, the adaptability of CLAIM is further highlighted. It outperforms other methods by a considerable margin, indicating its proficiency in utilizing available data points to predict missing values accurately.
MNAR: The MNAR scenario, characterized by missingness that depends on unobserved data, poses the greatest challenge. Here, CLAIM’s performance remains notably superior to traditional imputation methods. This robustness in the face of the most difficult missingness mechanism illustrates CLAIM’s potential to effectively mitigate the biases introduced by MNAR missingness, utilizing the LLaMA 7B model’s capacity to infer missing information from complex patterns.
To elucidate the superior performance of CLAIM over traditional baseline imputation methods, we delved into its performance on three particularly challenging datasets: Glass Identification, Seeds, and Wine. These datasets were selected due to the relatively lower performance exhibited by the LLM when utilizing fully complete versions of the datasets, highlighting their complexity and the rigorous testing ground they provide for evaluating CLAIM’s effectiveness.
Table 1 presents a detailed comparative analysis. For the Glass Identification dataset, where the LLM achieved an accuracy of only 69.40% with the full dataset, CLAIM demonstrated a significant advantage. It outperformed the best baseline method (kNN, which achieved 52.40% accuracy) by a substantial margin of 7.2%. This performance gap underscores CLAIM’s robustness and its ability to effectively handle missing data within complex datasets.
The challenge escalates with the Seeds dataset, wherein CLAIM surpassed the top-performing baseline method (MICE) by a margin of 4.2%. This further exemplifies CLAIM’s superiority in managing missing data, even in datasets where the LLM’s base performance is less than optimal.
The Wine dataset showcased a similar trend, with CLAIM exceeding the best baseline performance by a margin of 2.4%. It’s noteworthy that the performance gaps between CLAIM and the best-performing baseline methods are relatively modest under MAR conditions—2%, 3%, and 1.2% for Glass Identification, Seeds, and Wine, respectively. This observation suggests that while the predictability of missingness from observed data in MAR scenarios offers some leverage for traditional imputation methods, CLAIM still maintains a performance edge.
The MNAR scenario, characterized by the most complex pattern of missingness, highlighted CLAIM’s distinct advantage. Across all three datasets, CLAIM not only managed to outperform the best baseline methods but did so with remarkable performance gains of 12.4%, 7.6%, and 10% for Glass Identification, Seeds, and Wine, respectively. This substantial improvement underlines CLAIM’s adeptness at navigating the intricacies of MNAR missingness, further cementing its status as a highly effective tool for handling various missing data scenarios with aplomb.
Discussion on RQ1. CLAIM’s superior accuracy across diverse missingness patterns and datasets unequivocally affirms its effectiveness in a variety of challenging scenarios, thereby addressing RQ1. This consistent overperformance not only underscores its utility but also illustrates the significant benefits of integrating contextualized natural language models into the data imputation process. The pronounced accuracy improvements observed in complex datasets, such as the Glass Identification and Seeds datasets, point to a distinct advantage over traditional imputation techniques, which often falter under such conditions.
The robust performance of CLAIM, evident across MCAR, MAR, and MNAR missingness mechanisms, showcases its broad applicability and dependability. This marks a departure from conventional methods, which might only perform well under limited conditions or with specific types of data [20]. CLAIM’s methodology, which involves verbalizing data and employing contextually relevant descriptors for imputation, ensures its adeptness across various scenarios and data modalities.
Moreover, the minimal variation in CLAIM’s performance across different iterations further underscores its stability and reliability as an imputation method. Such consistency is indispensable for real-world applications, where the quality of imputation directly impacts the efficacy of subsequent data analyses. The ability of CLAIM to maintain a low error margin consistently highlights its potential as a go-to solution for data imputation, offering both precision and reliability.
[RQ2]: How does the choice of phrasing for missingness descriptors in CLAIM affect the performance of LLM-based downstream tasks?
Initially, we utilized contextually relevant descriptors for missing values, leading to unique phrases for different features within a dataset. To address RQ2, we aimed to determine whether using a uniform, yet contextually relevant, descriptor for all features would offer comparable benefits. To this end, we experimented with three consistent descriptors: “NaN”, “Missing value”, and “Value not recorded”. These experiments, focusing on the MCAR scenario, sought to ascertain whether it is more beneficial to use contextually nuanced descriptors or whether a generic descriptor is adequate to harness the LLMs’ general knowledge for managing missing values in datasets.
The experimental findings (Figure 3) illuminate the influence of missing data phrasing on the effectiveness of LLMs in addressing such situations. The results reveal a distinct pattern: generic descriptors, such as “NaN”, consistently perform worse than context-specific descriptors designed for each feature and dataset. Among the three fixed descriptors tested, there were some variations in performance. Both “NaN” and “Missing value” outperformed “Value not recorded”, with “Missing value” achieving the best results in most cases among the static descriptors.
The superior performance of feature-specific descriptors indicates that LLMs better interpret and manage missing data when it is described in a way that accurately reflects the context of the missing information. For example, a descriptor like “Malic acid quantity missing for this wine sample” allows the LLM to more effectively understand and address the missing data point than a more generic descriptor like “The level of malic acid in the wine is NaN”.
Discussion on RQ2. The findings related to RQ2 underscore the importance of context in the interaction between LLMs and missing data. The preference for context-specific descriptors over generic ones likely arises from the LLM’s capacity to utilize its extensive training on diverse language uses and contexts. When missing data is described in a manner that aligns with the specific context of a feature, the LLM is better positioned to apply its vast repository of knowledge to deduce or generate suitable imputations. This effectiveness diminishes with the use of generic labels, which offer minimal contextual information for the LLM to draw upon.
:::info
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
[2] https://pypi.org/