What Is The Best Way To Compress Big Data Without Sacrificing Accuracy?

Authors:

(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;

(2) David Saulpic, Université Paris Cité & CNRS;

(3) Chris Schwiegelshohn, Aarhus University.

Table of Links

Abstract and 1 Introduction

2 Preliminaries and Related Work

2.1 On Sampling Strategies

2.2 Other Coreset Strategies

2.3 Coresets for Database Applications

2.4 Quadtree Embeddings

3 Fast-Coresets

4 Reducing the Impact of the Spread

4.1 Computing a crude upper-bound

4.2 From Approximate Solution to Reduced Spread

5 Fast Compression in Practice

5.1 Goal and Scope of the Empirical Analysis

5.2 Experimental Setup

5.3 Evaluating Sampling Strategies

5.4 Streaming Setting and 5.5 Takeaways

6 Conclusion

7 Acknowledgements

8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2

8.2 Reduction of k-means to k-median

8.3 Estimation of the Optimal Cost in a Tree

8.4 Extensions to Algorithm 1

References

5 Fast Compression in Practice

Despite the near-linear time algorithm described in Sections 3 and 4, the coreset construction of Algorithm 1 nonetheless requires a bounded approximation to the clustering task before the sampling can occur. Although theoretically justified, it is unclear how necessary this is in practice – would a method that cuts corners still obtain good practical compression?

5.1 Goal and Scope of the Empirical Analysis

To motivate the upcoming experiments, we begin by asking “how do other sampling strategies compare to standard sensitivity sampling?” For this preliminary experiment, we focus on the uniform sampling and Fast-Coreset algorithm (Algorithm 1). For each, we evaluate its distortion across the following real datasets: Adult [32], MNIST [48], Taxi [52], Star [55], Song [12], Census [53], and Cover Type [13]. The dataset characteristics are summarized in Table 3.

The resulting comparisons can be found in Table 2. Since sensitivity sampling is the recommended coreset method [57], we use it to obtain a baseline distortion for each dataset[8]. We then compare uniform sampling and Fast-Coresets against this baseline by showing the ratio of their distortion divided by the distortion obtained by sensitivity sampling. As guaranteed by Section 4, we see that Fast-Coresets obtain consistent distortions with standard sensitivity sampling. However, the question is more subtle for uniform sampling – it performs well on most real-world datasets but catastrophically fails on a few (for example, it is ∼ 600× worse on the taxi dataset when compared with standard sensitivity sampling).

This confirms that uniform sampling is not unequivocally reliable as an aggregation method – although it is fast, it has the potential to miss important data points. On the other end of the runtime vs. quality spectrum, Fast-Coresets consistently provide an accurate compression but, despite being the fastest coreset method, are still significantly slower than uniform sampling. Thus, the fundamental question is: when is one safe to use fast, inaccurate sampling methods and when is the full coreset guarantee necessary?

We focus the remainder of the experimental section on this question. Specifically, we define a suite of compression methods that interpolate between uniform sampling and Fast-Coresets and evaluate these methods across a set of synthetic datasets. This allows us to experimentally verify the full spectrum of speed vs. accuracy tradeoffs and provide insight into which dataset characteristics are necessary before one can apply suboptimal sampling methods.

We complement this with a comprehensive study of related topics, such as additional analysis on real datasets, comparing against other state-of-the-art coreset methods, and verifying that these results extend to the streaming setting. Unless stated otherwise, our experimental results focus on the k-means task.

Table 2: Ratio of each algorithms’ distortion with respect to sensitivity sampling’s distortion. Failure cases are bolded.

Table 3: Description of real world datasets

What Is the Best Way To Compress Big Data Without Sacrificing Accuracy? | HackerNoon

Table of Links

5 Fast Compression in Practice

5.1 Goal and Scope of the Empirical Analysis

Leave a Reply Cancel reply

Stay Connected

Latest News

Apple Card holders can get six months of $0 delivery fees and 5% back on rides with Uber – 9to5Mac

Ecobee’s Smart Doorbell Camera now integrates with Google Home

Gates on Musk: 'World’s richest man killing the world’s poorest children'

iFlytek chair delivers confident speech to employees despite projected net profit fall of over 70% for 2023 · TechNode

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

5 Fast Compression in Practice

5.1 Goal and Scope of the Empirical Analysis

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News