By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Make Big Data More Manageable with Smart Sampling | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Make Big Data More Manageable with Smart Sampling | HackerNoon
Computing

Make Big Data More Manageable with Smart Sampling | HackerNoon

News Room
Last updated: 2025/02/21 at 8:12 PM
News Room Published 21 February 2025
Share
SHARE

Authors:

(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;

(2) David Saulpic, Université Paris Cité & CNRS;

(3) Chris Schwiegelshohn, Aarhus University.

Table of Links

Abstract and 1 Introduction

2 Preliminaries and Related Work

2.1 On Sampling Strategies

2.2 Other Coreset Strategies

2.3 Coresets for Database Applications

2.4 Quadtree Embeddings

3 Fast-Coresets

4 Reducing the Impact of the Spread

4.1 Computing a crude upper-bound

4.2 From Approximate Solution to Reduced Spread

5 Fast Compression in Practice

5.1 Goal and Scope of the Empirical Analysis

5.2 Experimental Setup

5.3 Evaluating Sampling Strategies

5.4 Streaming Setting and 5.5 Takeaways

6 Conclusion

7 Acknowledgements

8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2

8.2 Reduction of k-means to k-median

8.3 Estimation of the Optimal Cost in a Tree

8.4 Extensions to Algorithm 1

References

6 Conclusion

In this work, we discussed the theoretical and practical limits of compression algorithms for center-based clustering. We proposed the first nearly-linear time coreset algorithm for k-median and k-means. Moreover, the algorithm can be parameterized to achieve an asymptotically optimal coreset size. Subsequently, we conducted a thorough experimental analysis comparing this algorithm with fast sampling heuristics. In doing so, we find that although the Fast-Coreset algorithm achieves the best compression guarantees among its competitors, naive uniform sampling is already a sufficient compression for downstream clustering tasks in well-behaved datasets. Furthermore, we find that intermediate heuristics interpolating between uniform sampling and coresets play an important role in balancing efficiency and accuracy.

Although this closes the door on the highly-studied problem of optimally small and fast coresets for k-median and k-means, open questions of wider scope still remain. For example, when does sensitivity sampling guarantee accurate compression with optimal space in linear time and can these conditions be formalized? Furthermore, sensitivity sampling is incompatible with paradigms such as fair-clustering [8, 15, 21, 43, 56] and it is unclear whether one can expect that a linear-time method can optimally compress a dataset while adhering to the fairness constraints.

7 Acknowledgements

Andrew Draganov and Chris Schwiegelshohn are partially supported by the Independent Research Fund Denmark (DFF) under a Sapere Aude Research Leader grant No 1051-00106B. David Sauplic has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 101034413.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Meta, X approved ads containing violent anti-Muslim, antisemitic hate speech ahead of German election, study finds | News
Next Article Exciting Spider-Man 4 cast changes are in the works
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Cyberattack Takes Down Wisconsin-Based Mobile Carrier
News
How to Clean Up Your Gmail Inbox
Computing
Google Pixel 10 rumors: Everything we know (and everything we think we know)
News
Google’s Gemini AI is coming to Chrome
News

You Might also Like

Computing

How to Clean Up Your Gmail Inbox

28 Min Read
Computing

Meet QuickPR, Winner of Startups of The Year 2024 in Noida, India | HackerNoon

9 Min Read
Computing

Red Hat Enterprise Linux 10.0 Formally Announced, Joined By RISC-V Developer Preview

2 Min Read
Computing

Shein plans London stock market IPO in 2025: report · TechNode

3 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?