Authors:
(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;
(2) David Saulpic, Université Paris Cité & CNRS;
(3) Chris Schwiegelshohn, Aarhus University.
Table of Links
Abstract and 1 Introduction
2 Preliminaries and Related Work
2.1 On Sampling Strategies
2.2 Other Coreset Strategies
2.3 Coresets for Database Applications
2.4 Quadtree Embeddings
3 Fast-Coresets
4 Reducing the Impact of the Spread
4.1 Computing a crude upper-bound
4.2 From Approximate Solution to Reduced Spread
5 Fast Compression in Practice
5.1 Goal and Scope of the Empirical Analysis
5.2 Experimental Setup
5.3 Evaluating Sampling Strategies
5.4 Streaming Setting and 5.5 Takeaways
6 Conclusion
7 Acknowledgements
8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2
8.2 Reduction of k-means to k-median
8.3 Estimation of the Optimal Cost in a Tree
8.4 Extensions to Algorithm 1
References
5 Fast Compression in Practice
Despite the near-linear time algorithm described in Sections 3 and 4, the coreset construction of Algorithm 1 nonetheless requires a bounded approximation to the clustering task before the sampling can occur. Although theoretically justified, it is unclear how necessary this is in practice – would a method that cuts corners still obtain good practical compression?
5.1 Goal and Scope of the Empirical Analysis
To motivate the upcoming experiments, we begin by asking “how do other sampling strategies compare to standard sensitivity sampling?” For this preliminary experiment, we focus on the uniform sampling and Fast-Coreset algorithm (Algorithm 1). For each, we evaluate its distortion across the following real datasets: Adult [32], MNIST [48], Taxi [52], Star [55], Song [12], Census [53], and Cover Type [13]. The dataset characteristics are summarized in Table 3.
The resulting comparisons can be found in Table 2. Since sensitivity sampling is the recommended coreset method [57], we use it to obtain a baseline distortion for each dataset[8]. We then compare uniform sampling and Fast-Coresets against this baseline by showing the ratio of their distortion divided by the distortion obtained by sensitivity sampling. As guaranteed by Section 4, we see that Fast-Coresets obtain consistent distortions with standard sensitivity sampling. However, the question is more subtle for uniform sampling – it performs well on most real-world datasets but catastrophically fails on a few (for example, it is ∼ 600× worse on the taxi dataset when compared with standard sensitivity sampling).
This confirms that uniform sampling is not unequivocally reliable as an aggregation method – although it is fast, it has the potential to miss important data points. On the other end of the runtime vs. quality spectrum, Fast-Coresets consistently provide an accurate compression but, despite being the fastest coreset method, are still significantly slower than uniform sampling. Thus, the fundamental question is: when is one safe to use fast, inaccurate sampling methods and when is the full coreset guarantee necessary?
We focus the remainder of the experimental section on this question. Specifically, we define a suite of compression methods that interpolate between uniform sampling and Fast-Coresets and evaluate these methods across a set of synthetic datasets. This allows us to experimentally verify the full spectrum of speed vs. accuracy tradeoffs and provide insight into which dataset characteristics are necessary before one can apply suboptimal sampling methods.
We complement this with a comprehensive study of related topics, such as additional analysis on real datasets, comparing against other state-of-the-art coreset methods, and verifying that these results extend to the streaming setting. Unless stated otherwise, our experimental results focus on the k-means task.