Transcript
Vladimir Zakharov: My name is Vlad Zakharov. I have over 25 years of enterprise software development experience, primarily in the Java ecosystem. I also served on the Java Community Process Executive Committee, which is the body that governs evolution of the Java platform. I was a member of the expert group for JSR 335, which delivered Lambda Expressions and Stream API for Java, which went live with Java 8. The talk today will talk about DataFrames in Java, but it will also compare and contrast Java implementations with Python and other DataFrame implementations. We’ll specifically be talking about DataFrames in the context of data-oriented programming, which is where it’s probably the most fitting tool in your overall data manipulation toolkit.
Data-Oriented Programming
Before we go and talk about DataFrames proper, let’s have a brief overview of what we mean by data-oriented programming. I’m not going to do justice to this topic, so if this concept is new to you, I recommend looking at these articles at the blogs that you see on the slide. Just to define it, here is a definition given by Brian Goetz, who is the architect for the Java language, and his definition is as follows. Data-oriented programming encourages us to model data as immutable data and keep the code that embodies the business logic of how we act on that data separately. To understand it better, it probably helps to compare it to another popular paradigm, object-oriented programming.
Really the difference between those two approaches is in how we treat data or state and behavior. With the object-oriented approach, we all know about the need to encapsulate, to protect your privates, to model our system as objects responding to messages. We also use ideas like polymorphism so that we don’t necessarily know which instance of which class at runtime will be responding to a message. That hides not just the data, but also the behavior. The implementation of the behavior is not necessarily known at the time you write your code. With the data-oriented approach, you basically do the opposite. You model your domain as pure state, as a collection of properties. Think of it as a collection of Java records, or a database table, or an Excel table. Maybe not Excel table, but the idea is that you just have pure data and you operate on that data by writing functions that have access to naked data.
That data is obviously best kept immutable, because if you have multiple functions operating on shared state, then if they can mutate it, bad things will happen. Very different paradigms. One or the other could be a better choice depending on the context. We’re not going to go too deep into when one works better versus the other, but I do mention some of these decision points on the slide, and I again refer you to the blogs, to the articles, to get more in-depth understanding.
To make it a little bit more concrete, let’s look at this distinction between object-oriented and data-oriented approach in a particular domain. This domain is a donut shop, and this shop ships donut orders to customers. We’ll be using this domain throughout the presentation, so this is the introduction. Our main entities are the donut shop itself, which has customers. Customers have name and address. Then there is a menu of donuts available with their name, price, and discount price, if you buy a lot of them. Then you have donut orders. The way you represent it in object-oriented approach is by modeling each one of these entities as a class, potentially hiding implementation details.
Again, you may not even have state on the object. State is just a useful abstraction. When you ask object for a particular state, like how much is this order? It could be calculated value. It could be directly stored. It could be coming from some helper object. It’s all completely hidden. With data-oriented approach, you just have three tables. Again, could be lists of Java records, could be database tables. Then, if you want to implement some behavior, you just write, in this case, a Java function operating on a list of donuts. This function finds top three best-selling donuts by quantity. This is basically modern Java code that does data-oriented programming in Java.
In fact, Java support for data-oriented programming has been evolving. There are a lot of features added, like some of them you see on the slides. Record, sealed classes, pattern matching, let you manipulate these types of objects directly. Then, of course, collection, streams, and maps let you do set operations on collections of these entities. Having said that, these features do come at a cost. One is overhead. You see this is actual overhead computed using Java object layout framework for each one of these records. This is just simply what it costs you to have one instance of these. That’s really not a lot, but it will add up if you have millions or billions of these objects, which happens in some real production environment.
The code readability is fine. It’s certainly better than if you had a bunch of nested loops and for loops and if statements, but it’s still not amazing. You can’t just take one look and immediately understand what’s going on there. Also, when you define Java records, records are still classes. If your domain is not stable, if you have to do a lot of refactoring, maybe you’re in an early exploratory phase of your development, or maybe you don’t even own the data schema, the data structure that you work with. You may have to refactor a lot and that comes at a cost. Of course, one “flexible and powerful way to model pure data in Java is to use maps of maps, potentially of maps”, and so please never do it.
What is a DataFrame?
Now let’s throw a DataFrame into the mix. A DataFrame is just a data structure. It’s a tabular dataset, which is made up of columns of different types, again, similar to a relational table. Here is an example of a DataFrame on the slide, and these are donut orders. Again, nothing special there, looks like a table. Of course, one question could be like, if this looks like a relational table, why don’t I just use a relational database to do operations on these datasets? The answer is, of course, you can, but that requires a database. If you don’t have a database in your pipeline, or the database doesn’t have the data that you want to operate on, then inserting it just for the sake of doing some filtering, aggregation, transformation may be just too much in terms of the cost, both development cost, infrastructure cost, processing time, and so on.
As I alluded to in the previous slide, DataFrames give you the ability to easily transform data. You can think of it conceptually like relational database operations. You can aggregate, you can join, you can filter. It does do it in a way, because it operates at a high level, that makes it easy for developers. We’ll see, of course, code examples. It also makes it easy to separate the data manipulation logic from the data itself, and even externalize that logic and take it out of your applications.
Another thing that DataFrames are good at, again, we’ll see examples of that, is that they can do a lot of optimizations under the covers. They can use efficient collection frameworks, which are much more efficient than your standard JDK collections. They can do things like object pooling to save memory and to improve performance without any effort from the developer writing code. It could be much more efficient compared to alternatives. If you already have a database with all the data you need, use it, but if not, this is something to consider. These are not really specifically talking about Java DataFrames. This framework has been around for some time, and they’re used in real-world production scenarios for things like data transformation, data enrichment, reconciliation, and many other use cases.
Example – The One Billion Row Challenge (1BRC)
Let’s look at some code. For our first example, we’re going to look at The One Billion Row Challenge problem. This is a challenge that Gunnar Morling came up with, and it ran in January of 2024. The idea was to see how fast a Java program can process and aggregate a large dataset, in this case, one billion rows of temperature measurement values. You have a file with one billion rows. It has a weather station, which is just some location somewhere on Earth, and it has temperature measurement. What you need to do is calculate the minimum, average, and maximum temperature per weather station. There is just one caveat, the file has one billion rows.
If you haven’t heard of it, I highly recommend checking it out. You can find the challenge and the solutions on the slide. The reason we’re looking at it is not because we have any hope of beating the winners of the challenge, but because it’s a basically classic data-oriented programming problem. It makes no sense to design elaborate object-oriented domain for this task. We’ll see how well different DataFrames can handle it, and also how easy it is to write and to read the code, how concise, how expressive it is. Before we dive into that, just a quick overview of the results of the actual challenge. Those results are pretty amazing.
The top three results all finished in about a second and a half. If you look at the top results, they do have a lot in common. They’re all long. It’s many hundreds of lines of well-formatted, well-factored code. They do use the APIs that developers don’t usually use in their daily coding activities, things like vector API, other low-level APIs, some experimental JVM features. Because of that, it’s not immediately obvious how these algorithms work. Although some of those solutions are pretty well-commented. Also, it’s not obvious what the code does functionally. Obviously, when you look at it, you know that it solves One Billion Row Challenge, because that’s what it says.
Given a solution, it would take you some time to figure out what is it that it actually does. All these top solutions, people worked on them up until the challenge deadline. The time to market for them was about a month, since it ran for a month. Again, it’s a fun challenge. It all made sense. It doesn’t mean that there is anything wrong with the code. In a real-life scenario, we don’t necessarily want to squeeze every CPU cycle out of our solutions. We want to optimize for things like maintainability, developer time, time to market.
1BRC – Optimized for Devs
What if we flip our requirements and we’re focused on achieving those metrics? Let’s throw in DataFrame. For this part, we’ll look at four DataFrame implementations. First one is Python/pandas. It’s something that I suspect most people think about when they hear the words DataFrame. This is here just to establish a benchmark, both in terms of performance and in terms of code readability. Again, how easy it is to write, how easy it is to read. We’ll look at DataFrame-EC and Tablesaw. They are pure Java DataFrame implementations. I’m the author and the committer for DataFrame-EC. I picked Tablesaw because it’s another implementation, which is pure Java. It seems to be actively maintained and it’s used in various other projects.
The last one we’ll look at is Kotlin DataFrame. Kotlin comes with a DataFrame library. It’s not Java, but it runs on JVM. It’s an option that is available to folks working on JVM. People seem to be intrigued by all things Kotlin. I decided to throw it in there. The file format is what you see here. This is what the actual file looks like, except in a real challenge, there is one billion rows in there. The code you’re going to see will work with one billion rows of how many. We break this into three steps, load data, perform aggregation, and sorting, and then just print results on the console.
Then, at the end, we’ll look at performance and memory usage comparison between all these frameworks. Starting with Python/pandas. The code here is pretty straightforward. Even if you don’t know Python, it’s easy to figure out what’s happening. The read_csv method reads CSV. FILE is a variable where we have file name, and it has these nice named parameters. You can look at it, and it will be clear what’s happening. We get our DataFrame parsed and loaded. Processing. We want to groupBy with aggregation. There are two options here. One does groupBy and aggregation, and then sort. We want results to be in a consistent order. We will also have a sorting step.
By default, groupBy and aggregation do sorting. I read in the documentation that doing sorting separately will result in faster performance. I haven’t seen the difference, but these are two options. I’m showing them on the slide. It’s pretty straightforward what’s happening here. The only thing to call out is after you do aggregation, you end up with hierarchical columns, which is a feature of pandas. To print it out, we just flatten the column hierarchy to make it easier to work with. This is your standard Python iterator pattern, the for loop. We iterate through the rows and just print the formatted view of each of the rows. Here we are. Done.
Let’s look at the Java examples. We’ll start with DataFrame-EC. The first thing we do, we get file location. Then we define the schema. Unfortunately, although it’s a controversial topic, we don’t have named parameters in Java, but we use this builder pattern. Again, it’s pretty straightforward, I think, to see what’s happening here. You have two columns, station and temperature. Your separator is a semicolon, and there is no header line, so you specify all that.
Then you create a dataset, which links that schema to an actual physical file location, and you load it as a DataFrame. Here it is, parsed and loaded. Aggregation of processing. Again, that looks pretty straightforward. You think of it as doing groupBy in your relational world. You aggregateBy, and then you have a list of functions, and the aggregation function takes the column you want to aggregate, and the name of the target column. You have these aggregations, then you groupBy weather station, and then the next step you sort by weather station. You may be wondering what this immutable business is, and we’ll talk about it.
Just to clarify things, DataFrame-EC is based on an open-source library called Eclipse Collections, and Eclipse here as in Eclipse Foundation, not Eclipse IDE. It’s a fully compatible replacement for JDK collections. Like for existing data types, you can just do a drop-in replacement. It offers a lot more collection types, including full support for primitive and immutable collections. DataFrame-EC is based on that, so that’s the C in the name. Immutable here is just a factory method that creates an immutable list, in this case, of aggregation functions. Output just uses a standard forEach pattern. We iterate over the rows of the database, and it takes a lambda where we pass in a row object, and then we just print it. Fairly straightforward and fairly similar to what we saw in Python.
Tablesaw, works exactly the same way. You get the file. You define the schema. The DataFrame object in Tablesaw is called table. The only difference here is that if you don’t have column names, you can’t specify them in your schema, so we have to name our columns after we loaded the file. Aggregation, groupingBy, and sorting, work again the same way. You also don’t get to name the columns here, so the column names mean and max are created for you, but otherwise it’s very similar to what you saw with DataFrame-EC. The output works pretty much exactly the same. If you look at it, you may be wondering, are those frameworks all the same? Are those APIs all the same? The answer is, sometimes. There are instances where there are major differences between the code you write based on frameworks, APIs, and some design choices. We’ll see examples of those in a bit.
Let’s look at Kotlin. Loading bit is pretty standard. Processing is interesting. There are a couple of things I want to call out. First of all, groupBy actually physically groups by date and creates materialized DataFrame of DataFrames. That has to have some cost memory-wise. Then, when you look at the aggregation, you see this weird syntax where it says into. That’s actually a Kotlin feature that lets you write internal DSLs like this. Into is not a part of the language grammar. It’s an infix function, which, in a situation like this, may make it easier to read code. It’s obviously coming from Java, seen infix functions, and somewhat unusual. Then, sortBy works as expected. The iteration, the output using forEach pattern, again, pretty straightforward. We can skip that and look at the performance, specifically execution time and memory usage.
First, you see we talked about One Billion Row Challenge. This is not The One Billion Row Challenge, this is 100 million row challenge. The reason we had to introduce that was because Kotlin just didn’t finish The One Billion Row Challenge. We could talk why it’s so, but I wanted to have a comparison in the same context, so we created 100 million row challenge. The way to read this slide is the first forEach of the frameworks, we have Python, DataFrame-EC, Tablesaw, and Kotlin. The first column is execution time, split between time to load and time to aggregate. The second column, the purple one, is maximum heap size. With the caveat that for Python/pandas, there is no heap, it’s a space used by the operating system process. I think we have a pretty good idea about Kotlin.
Let’s look at The One Billion Row Challenge results. A couple of things stand out. In terms of performance as measured by execution time, Python comes pretty much ahead of everyone by some relatively significant measure. It’s really not doing great when it comes to memory usage. That’s likely to be a limiting factor when you run on commodity-size hardware. That’s just a feature of pandas. If you want something that’s memory efficient at large memory volumes, it’s not going to work for you. For the Java frameworks we have two measurements for each, one with 16 gigs of heap allocated, the other one with 32. The idea was to just compare how frameworks behave under some level of memory garbage collection pressure. The results, as you can tell, Tablesaw is actually a bit faster when it comes to loading data. When both frameworks are under pressure, they use all the available memory, but when memory pressure is removed, Tablesaw actually is more memory efficient than DataFrame-EC.
Then, an interesting thing happens when it comes to aggregation, Tablesaw does worse than DataFrame-EC by an order of magnitude. I’m not exactly sure why. I haven’t debugged it, but it does rely for its aggregation implementation on Apache Commons libraries, and Apache Commons weren’t designed with a lot of attention paid to efficiency and performance and speed, so that could be hurting them. Both DataFrame-EC and Tablesaw do much better than Kotlin in general because they could finish it, but also if you look at loading performance and memory usage, they both rely on highly efficient collection frameworks. For Tablesaw, it’s fastutil, and for DataFrame-EC it’s Eclipse Collections, so that really shows here.
About DataFrame-EC
I’m going to talk about DataFrame-EC, what makes it interesting. The first thing is the grammar for the expression DSL, which is an external DSL, just like strings, we’ll see examples, that’s used for computed columns, filters, and any other kind of data manipulation expression. You can extend the DSL functions and the aggregation functions at runtime without touching the framework. It’s also very flexible when it comes to handling nulls, and you can add custom behavior when you want to handle nulls in certain context in a special way. This is driven by the needs of legacy enterprise data flows where nulls could be interpreted in a number of weird and wonderful ways and not what we would like to think null means from the pure computer science perspective. Like I mentioned, built on Eclipse Collections framework and benefits from the underlying efficiencies. It does expose, as you saw, Eclipse Collections types in its API, but again, that’s a hard dependency, so there is no need to try and hide it.
Eclipse Collections
This is more about Eclipse Collections, and it tells you what it adds on top of your standard JDK collections. I ticked off the features that DataFrame-EC takes advantage of. To see what DataFrame-EC, and the same applies to Tablesaw, how it actually works. If we look again at that DataFrame that we saw earlier, this is our donut order DataFrame, and it looks like a table. If we peel off the cover, we’ll see that we actually have a lot happening under the covers. We use primitive collections, primitive lists, and here you see examples of int and double lists to store primitive values. There is a pool type for pooling objects, so all strings, dates, get pooled, which you actually benefit a lot in real-life datasets.
Another example of that is The One Billion Row Challenge, where we have one billion observations, but in the data sample that was used, there are only 400 odd weather stations, so you get a lot of compression from using pooling under the covers. Then you have data structures like, for example, a bitmap, which is a mutable Boolean list, but it’s a bitmap under the covers, which is used for marking null primitive values, since we don’t have primitives for nulls, but we want to support null primitives. Other data structures like multimap used for indexing. For example, if we want to index DataFrame by name, we’ll get that data structure you see on the right. There is a number of fairly rich APIs, that go beyond your standard map or forEach that were, again, very much inspired by Eclipse Collections, which rather than having some small number of very generic APIs, tries to provide reasonable, functional, rich APIs that are driven by the real use cases.
Donut Store Example
Now let’s look at more functionally interesting scenarios and see how different frameworks operate differently with those. Again, we’re sticking with our donut store example. We have three entities, customers, the menu of donuts, and the orders, which are all represented as these tables, since we’re sticking with the data-oriented paradigm. In this case, these tables will be DataFrames in their respective frameworks. For the purpose of this exercise, we only look at pure Java DataFrames, DataFrame we see in Tablesaw, but all the code is on GitHub and there are Kotlin and Python examples as well, as well as pure Java collection-based examples. We’ll look at four use cases, list donuts in the popularity order, similar to what we saw at the beginning of the talk. Figuring out priority orders for tomorrow. Just somewhat arbitrary, I decided priority orders are large orders with quantity greater than or equal to 12, or orders that go to customer named Bob.
Then we compute total spend by customer, so basically add up the prices of all their orders. The last one will let us look at the pivot functionality, and that’s just donut count per customer per day. Listing donuts in popularity order. We’re actually not going to spend much time on that because it’s very similar to aggregation and grouping we saw with The One Billion Row Challenge. The only difference here is that we sort by quantity, so we sort by quantity in descending order because we want the most popular donuts first.
Then, ascending order by donut name, just to break ties so that we get consistent ordering even when the quantity is the same. The last method, because we only care about donut order, not the actual quantities, we call keepColumns, which drops the rest of the columns and just keeps one column, Donut. Here is our answer. Doing it with Tablesaw, pretty much the same. Slightly different method names and slightly different way of specifying sort order, but otherwise, again visually, conceptually, it’s the same. Again, when you think about solving a problem with DataFrame, it helps to think about, how would I do it with relational world? How would I do it with SQL? You’ll do groupBy. That’s how it works.
This is where we see some differences. Priority orders for tomorrow. With DataFrame, we see we have a method on DataFrame called selectBy, and it takes an expression that basically gets translated into internal expression and gets executed, applied to the DataFrame. Tomorrow is just simply a local date variable containing whatever tomorrow is. This is this internal DSL. It’s like its own language, but I would argue that pretty much all of you should be able to look at it and figure out what’s happening here. You don’t need to read the manual or learn things, or go up some steep learning curve to see what this means. It should be equally easy to figure out how to write expressions like these. Another benefit of this is you can take it out of your Java code, you can put it in a configuration file, you can put it somewhere, externalize it.
The reason you would want to do it is because this is a piece of business logic, and maybe you want your business users, depending on your context, to actually own the business logic. This is something that, again, can live outside your software and then the accounting department or the operations users or whoever can actually work with that. If we see how it works with internal DSL where it’s just Java, and this is a Tablesaw example, again, it’s very similar. You have a where method, but the expression itself is a Java expression. The benefit is if you type something wrong, your IDE will tell you with a red squiggly, but you still have strings here, so you need to test your code for many reasons.
Another problem with this is that, you can look at it. It’s a pretty nice declarative code, but it’s still probably harder to parse for a reader than the previous example, and it’s pretty much impossible to externalize. If you say, we want business to own the logic of what priority order means to us, and we want to govern it separately from the application code, you can’t do it. You can’t give your donut shop, whatever, employee access to Java IDE and have them submit pull requests every time they want to change this definition.
Total spend per customer. How do we calculate that? We need to calculate the dollar amount of each order. To know that, we need to know the quantity and the price. The price we use depends on the quantity, so we want to bring together the donuts with their regular price and discount price and join it to the order dataset, and then we do our usual groupBy and aggregation and sort by customer. Joining works very straightforward with DataFrames. I’m just showing both on the same slide because again, they are so similar. Simple, you specify which column you want to join by. In case of DataFrame-EC, the first example, you can also specify which columns you want to bring from the DataFrame you’re joining to, and in this case, both will produce the same result, which is a DataFrame which has two columns added to it.
The next one is another example where you have difference between external DSL for expressions that you see here, and internal DSL we’ll see on a later slide. This is exactly the formula that you saw when we looked at the requirements for this use case. Quantity less than 12, use price, otherwise use discount price, and multiply by quantity. It’s very readable. You don’t need to specify the column type because under the covers, everything is actually strongly typed, and there is type inference being done for you.
If you try to multiply quantity by donut name, you will get a meaningful error message telling you that you shouldn’t be doing that. With Tablesaw, again, here is an example of what normally would be like nice declarative code. You don’t need to specify column type, but then, again, you can read it. You can figure it out. If quantity is less than 12, use this, otherwise use that. It’s a lot of text to express this simple piece of logic, and again, it’s baked into your application. It’s not externalizable. In Tablesaw, there is also a way to do it like this, in an imperative fashion rather than declarative. Declarative approach is probably better in this context.
Aggregation and sorting, that’s pretty standard. We saw that before. Donut count per customer per day, that just shows that it’s very straightforward to implement pivot, similar to the pivot functionality you would have in Excel with DataFrames. Again, side-by-side examples, because they work the same way. Two differences to call out. In DataFrame-EC, you see the parameters to pivot by and the values to aggregate as lists, because it supports multiple dimensions and multiple aggregation values, just like Excel does. In Tablesaw, you can only have one of each, but for this example, they both work. If you don’t have values to aggregate for particular coordinates in your pivot table, Tablesaw will just say it’s a missing value, which is equivalent of null. DataFrame-EC will actually have the value as 0. Again, we can have a philosophical discussion which one is more correct, but you can change this behavior if you feel it should be done differently.
Takeaways
Some takeaways. Java DataFrame is a useful addition to your data manipulation toolkit. Again, when you have a use case that seems like a fit for data-oriented programming paradigm, before you get your data in a list of records, or in a relational table, or a Spark cluster, maybe a DataFrame solution would be a better approach, especially compared to things like a Spark cluster, or maybe even actual relational database. You can actually process a lot of data relatively quickly on commodity hardware. Something to keep in mind. Common scenarios, ad hoc manipulation, playing with the data, so things you may do using Python/pandas, Notebook, and this is what people tend to reach for when they have this need.
Then you end up with a piece of code that solves your problem, but it solves it in Python, and your backend is a Java service, and now you have to figure out how to rewrite it. If you use a Java DataFrame, you could just simply take this code and basically stick it in your program. Then you can use it to programmatically transform query-analyzed data, so your standard enterprise data processing pipelines can benefit from it. Of course, your standard use cases for DataFrames, for data science notebooks, and visualization are supported by some of the Java DataFrame frameworks.
Resources
This talk is available on GitHub with the sources. You have links to DataFrame-EC and Eclipse Collections, and a couple of other pure Java DataFrame libraries to check out. When it comes to Java DataFrames, you have choices. Go explore them and have fun with DataFrames.
Questions and Answers
Participant: Two practical questions about DataFrames. First of all, one of the major points about databases, obviously, is you have constant change, tons of inserts, tons of reading, mutability. I’m assuming that you’re not really concerned about that with DataFrames. You have a static dataset, you read it in, you analyze it, you’re done. Is that correct?
Vladimir Zakharov: Yes. Again, just looking at the real-life enterprise data pipelines, which may not necessarily be scenarios you have to deal with, but a lot of data is just files. Files, messages, they’re detached from their store of record, which may be a database. Whether we’re talking about various business units within a company or different companies, financial institutions exchanging data, a lot of it, if not most of it, is files. When you build the data processing pipeline, maybe the final resting place for your data needs to be a database for whatever reasons, but for data enrichment, data transformation, data validation you just have a file which is like this static piece of data and also a nice recovery point, and you just work with it, with your DataFrame.
See more presentations with transcripts
