Transcript
Kett: This talk is all about ultra-fast in-memory database processing with Java. Who of you is a Java developer? Who of you is a database developer? Who develops database applications or works on database stuff in general? Who has performance issues? This is a question for the database vendors. In this session, I will show you how you can build the fastest database applications on the planet. This depends on you, what you’re doing, which solution you choose. This is an approach that is not new. It’s used by gaming companies, online banking companies already for 20 years. Now we have a framework.
This talk is about how you can use this approach. Here we can see a lot of fancy new applications, applications of the future, so virtual reality, AI, everything is about AI these days, blockchain, and so on. For all of these modern applications, there are some factors, super important and critical. Everybody wants high performance, of course. Of course, we want low data storage costs in a cloud. Simplicity is very important for developers. Sustainability is very important for managers and organizations. Today, the reality is different. I will show you why.
My name is Markus. I’ve worked on Java for more than 20 years now. With my team, I work on several open-source projects. I’m also an organizer of a conference, try to give something back to the Java community. This is always a lot of fun. With my company, we are very active in the community. We are a member of the Eclipse Foundation. Most people know about Eclipse Foundation because of the Eclipse development environment, but it’s much more. We run more than 100 open-source projects under the roof of the Eclipse Foundation. Java Enterprise is now part of the Eclipse Foundation, it’s now called Jakarta EE. We are also a member of the Micronaut Foundation. Who of you knows Micronaut or uses Micronaut? This is a microservice framework. We are also contributing to the Helidon project. Who of you knows what Helidon is? Helidon is also a microservice framework and runtime for building microservices in Java. It’s driven by Oracle, but it’s open source.
Data Processing, Today
Let me talk about database development of today. The situation is a little bit different. In my previous project, we’ve worked on a development environment, based on Eclipse. It should become a Visual Basic for Java. We developed a GUI builder. Everything went fine. We developed a Swing GUI builder, then a JavaFX GUI builder, then a WADing GUI builder for creating HTML user interfaces. The problem we had, as soon as we want to show data on a screen, then everything went bad, slow, complex, so we tried to improve this. We’ve worked on the JBoss Hibernate tools for Eclipse for almost 10 years now. We tried to simplify the Hibernate tools to accelerate speed, and we were not successful. Why? It’s because there are so many technical problems. This is my background.
When I talk about database programming, please keep this in mind. I worked on database stuff, traditional databases, for more than 10 years. It’s great. What we have in Java is great. We have a lot of challenges with this technology. Here’s why. Today, database programming is mostly too slow. Performance is too slow. This is why you were laughing when we talked about performance issues. Database costs in the cloud are mostly too high. All managers talk about the cloud costs are skyrocketing, and the complexity is way too high, and the systems are mostly not sustainable. Now I want to show you why.
Let’s have a look at how the database server concept works, actually. We have an application. Here we have a JVM, and we have memory. We have an application or a microservice, and we have a relational database. Let’s have a look inside the relational database, because mostly it seems like a black box. We send an SQL to a database, and we get the result. Great. When we have a look inside the database, then we can see there is a lot of memory. We have a server, of course. Then we have storage. We have a database management system, and probably there is also business logic running inside a database, stored procedures, stored functions.
Please keep these components in mind. Storage, computing, a lot of memory, and maybe business logic. What’s the problem here? When I came to Java more than 20 years ago, I was stupid enough to ask one question, what’s the difference between Java and JavaScript? The Java developers, they told me, “Markus, you cannot compare Java with JavaScript, because Java is object-oriented. It’s type safe”. That’s great. That is super important. That is what we love. Now I got it. This is important for database programming. Everything is great when we do it in Java. Everything is object-oriented, type safe.
Clean code is super important. As soon as we want to store data in a database, then the horror begins, because all database systems on the market are incompatible with the programming language. It’s the same in .NET. It’s the same with object-oriented programming languages, incompatible. It’s because you cannot store native Java objects seamlessly in a relational database. This is impossible. We have some impedance mismatches here. Granularity mismatch subtypes, so inheritance is not supported by the relational model. Then we have different data types. In Java, we have some primitive data types. In PostgreSQL, we have around 40 or even more data types supported by the database. This is always a challenge.
The question is, what about the NoSQL databases? Who of you uses NoSQL databases today? Are they better? The fact is they are very different. What’s the difference? The NoSQL databases now, they introduce new data types, new data structure. This is the biggest difference. The functional principle is pretty much the same. They are also server databases, mostly. Now they introduce key-value. They introduce documents like JSON, XML, or a column store, or graph database like Neo4j. We have the object-oriented databases in the 1990s, because, initially, we want to store objects in a database, so it was obvious to invent object-oriented databases.
Obviously, it didn’t work well. We have time series databases. Now with AI, we have the vector databases. What database should we choose? They are all also incompatible with the native object model of Java. That’s a fact. They are also incompatible, and that’s a challenge. In Java, we can do everything. We can handle all types. We can store and process all data structure and data types. We can deal with everything. That’s great. This is different with databases. They are limited in terms of the use case. This leads to big challenges. You can read more about this in the internet, or even on Wikipedia, we can find an article about object-relational mapping or impedance mismatches.
This is how it works. In our application, we need something additional to store data in a database. We use object-relational mapping. This is a well-known concept, and it’s worked great for decades. Who of you uses Hibernate, EclipseLink? Object-relational mapping is very common to store data in a database, or Java object in a relational database. There are drawbacks. This is super expensive because object-relational mapping is very time-consuming and it leads to high latencies. Suddenly, your queries become really slow. This is what we found out. Is this true? Yes, we agree. Not always? Mostly? Sometimes? We can fix this problem, of course. Let’s add a cache. This is what we did in our development environment. We introduced Hibernate. Then it was too slow. Then we added a cache. Then we have additional complexity.
Now we have to deal with cache configurations and so on. Now the results are stored in memory. This will be way faster. We were not satisfied with the performance, actually. Why? Have you ever measured how long it takes to read data from a cache? I grew up with assembly programming. In the keynote, we heard about assembly will become, hopefully, more popular in the future when we deal with quantum computing. When I had a Commodore 64, then I was able to process data in memory in microseconds. When I read data from a local cache with Hibernate, then it takes milliseconds. I was like, what’s the problem here? Why does it take milliseconds when I fetch data from a local cache? The problem is object-relational mapping. Obviously, this is super expensive.
Then we talked about single-node applications. Who of you develops distributed applications? That’s a little bit more complex. Now, here we have an application that runs on multiple machines. What’s happening when you change data on one machine? Then the machine will be synchronized with the database. Everything is fine. The problem is all other nodes are not in sync with the database. This can be a problem. We are developers. We can solve this problem. There is another cache strategy. Let’s put the cache in between the database and the application layer. Because we are in the cloud, so we want to avoid a single point of failure, so we use a distributed cache. We use a cache that is executed on multiple machines. Who uses a distributed cache like Redis? Very common.
Then we have such an architecture. You can see the machines growing more and more. Does it make sense to run a cache without memory? No, it’s nonsense. Of course, we need a lot of memory. We use memory, and we need memory. What about the database? Do we run a database application on a single database node? Probably, yes. If the application is mission critical, maybe you will run a database cluster to share the load, data redundancy, and so on. We have a database running on multiple nodes means there are more machines running. Does it make sense to run a database without memory or low memory? It can do that, but it will be slow. You need a lot of memory in a database server as well.
Now, we talk about an application that runs on multiple machines. Now we deal with microservices. We split the application in multiple services, and it looks like that. We have a lot of machines running to maintain. This is very common. Then we have a great database, and databases are so fast today. Who of you uses Elasticsearch? Why? The database is fast enough. Obviously, sometimes it’s not fast enough, so you add another solution, and now you can explain to your managers why cloud computing is so expensive. This is really true. This is not the case in all applications. Sometimes you have only one solution or two or three solutions.
On top of that, we talked about data structure, data types. Let’s say you have your Oracle database, and then you need some sensor data, you will have a time series database, probably. Then you deal with vector. Then we have a vector database for AI, so you have, on top of that, multiple database systems running. This is the reason why database development is super effortful, expensive in the cloud, slow. It’s not sustainable, actually. It will produce a lot of CO2 emission and consume energy. Let’s wait for quantum computing. See you next year.
Alternative Java-Native Approach
What’s the alternative? Is there an alternative, actually? Yes, it is already. You don’t have to wait for quantum computing, if you change the software stack. This is not magic, it’s actually obvious. Let’s have a look at how it works. Here is a solution for cheap data storage. When we use a PostgreSQL database, for instance, it’s a server database, and this is an example based on AWS. You use PostgreSQL as a service, just with 2 CPUs, 8 gigabyte memory, and 1 terabyte memory. Run it on one node. It will cost you around $4,000 per year. If you need multiple instances, of course, your price will double, triple, and so on.
If you need more nodes, six nodes will cost you around $30,000 per year. The cloud providers, they provide us Blob storage, or binary data storage like AWS S3. The cool thing here is it costs almost nothing. 1 terabyte S3 costs only $300 per year. That’s great. You can have the same on Azure or Google Cloud. There is a solution where we can save a lot of cost in the cloud, and look at the CO2 emission. It’s almost nothing. The energy consumption is 99% lower. You don’t have to maintain it, it’s managed by the cloud provider.
Here are some facts about Java. Because on all conferences, we talk about, we love Java. If you attend a Java conference, you will hear this phrase, we love Java. Now let’s have a look on why we love Java. It’s so fast. Everything that’s executed in memory in Java is executed in microseconds. This is similar to my Commodore 64. Sometimes it’s even faster, even nanoseconds, because of our great JIT compiler. We have the best data model on the planet, objects, object graphs. We can deal with all data types. We can deal with all data structure, vectors, JSON, XML, relations, graph, like graph database. Everything is possible. This is a multi-model data structure from the beginning. No limitations in terms of the use case. What about searching and filtering? We have Streams API. With Java Streams, you can search and filter in memory in microseconds. You can compare this with a JPA or a SQL query. Mostly 1,000x faster than a comparable JPA query.
Now I will show you a brief demo. Here we have two applications running in parallel. One is built with the JPA stack, so with Hibernate. We have a PostgreSQL database, 250 gigabytes. This is a bookstore application. We use Ehcache. It’s a hot Ehcache. This is in memory, so we fetch data directly from memory. On the right, you can see the query code. Here we use Spring Data as a framework. The second application is built with EclipseStore. We use a Blob store, like S3, and it is S3. We use Java Streams to search and filter. All queries are executed, sometimes 10 times faster, sometimes 100 times faster, sometimes more than 1,000 times faster than the comparable JPA query. Keep in mind, we fetch data directly from a cache. With Java Streams, we are up to 1,000x faster than Hibernate Cache. This is the performance of Java. You can improve it even by changing the JVM, for instance. JVM, you can accelerate in-memory processing by, for instance, the OpenJ9 JVM. It’s also an Eclipse project. It can be 20% more efficient and faster than HotSpot. You can play around with the different JVMs. It’s incredibly fast.
EclipseStore
What’s the problem? The only thing missing in Java was persistence. How can we now store data on disk? This is what we have developed at the Eclipse Foundation. This project is not a prototype or just an idea. We have been developing this for more than 10 years. It’s production-ready. It’s in use. It’s in production use by companies like Allianz, Fraport, here in Germany. More companies are using this framework. It’s under Eclipse public license, which means you can use it for commercial purposes free of charge. There are four benefits. 1,000x faster data processing in memory. You save more than 90% cloud database costs, and we do not talk about license fees. It’s Java-Native, which means simple to use. It’s fully object-oriented. It’s type safe. It feels like a part of Java. This is very important. Because we don’t need a database server anymore, just storage, we save 99% energy and CO2 emissions, and you develop the fastest application on the planet, and at the same time, you save the planet. How great is this? How does it work? What actually is EclipseStore?
It is a micro-persistence engine, so it is a persistence similar to Hibernate, to store native objects. This is the difference to Hibernate, to store your native Java objects seamlessly to disk, and to restore it when needed. That’s the functional principle of the framework, without object-relational mapping, without any mappings, without any data conversion, there’s no more JSON conversion behind the scenes or something like that. It’s the biggest difference, very important to all databases on the market, no mappings, no data conversion, the original Java model is used. Use the original Java object model, and you can persist your POJOs seamlessly into any data storage.
It’s just a Maven dependency. It’s very easy to use. The whole framework has only one dependency to the Eclipse Serializer that’s used behind the scenes. The only thing you need is an EclipseStore Instance. This is how it works through runtime. You need an instance of your data storage in memory, and in memory it works like a tree. Who of you was a Swing developer? What about JavaFX? It’s the same here with EclipseStore, you need a node, an instance, a root object, and then you add objects, and all objects that are reachable from this root object can be persisted and stored on disk. This is the functional principle. I create a root object, add some objects. You can use all Java types. Only Java types that can be recreated can be used and stored. You cannot store a thread, obviously, but all other Java objects can be used.
Then you call a store method, and then a binary representation of your object will be created and stored on disk. The information is stored in a binary form, and we use the Eclipse Serializer for creating the binary, and store it on disk. This operation is transaction safe. We get a commit from the engine, and then it’s guaranteed that the object is really stored on disk. Let’s add some more objects. We call a store method, and another binary file is created. This is how it works. In each store method, each store operation creates a new binary file in the storage. It’s different to the relational model. It’s an append log strategy. The method call is very simple, just one method to call, and then you can store your objects. This is a blocking, transaction safe, all or nothing atomic operation. Vice versa, when you start the application, what’s happening? When you start an application, then the framework will load your object graph into the memory.
Handling Concurrency in Java
Kett: How does it work with multiple threads? You can use all Java concepts to handle concurrency, but you have to care for concurrency. We have to handle this in Java, or we can handle this in Java. Then you have full control on which objects and which threads store the object transaction, save to disk. You will get a commit from the library.
EclipseStore
Kett: When we start an application, then the engine will load the whole object graph into memory. Now this is, at this point, very important to mention. The object graph information is all loaded. Only the object graph information, which means only object IDs are loaded into the memory. We will not load the whole database into the memory. Only the object IDs are loaded, so you’ve got an indexed object graph in memory. Then you can define which object references should be preloaded in memory or should be loaded on demand by using lazy loading. You can have a terabyte, tons of object in your storage, you have only 2 gigabyte memory, it will work. It’s super easy to define your classes as lazy or eager, this is just a wrapper class. Then the engine will either preload object references in memory or load it when you call the object with a GET method. This is how it basically works.
Queries are simple, because we use Java Stream’s API for searching and filtering. This is very fast. You can check this out, each query will take only microseconds, mostly because of the speed of the Java Stream’s API memory. The storage will grow more, and so this is the reason why there is also a garbage collector for your file storage. If you have older objects in the memory, and we change the data model, then we have lazy objects in the memory or corrupt objects, and a garbage collector process will clean up the file storage constantly and will keep your storage small. This is the functional principle.
The Eclipse Serializer is the heart of this framework. On top of that, we provide an implementation for the JVM. Eclipse storage is built for the JVM, but there is also an implementation for Android. Who of you is a mobile developer or develops mobile applications as well? What happens if your classes change? This can be challenging, but it’s not with EclipseStore, because we have a concept that’s called the legacy-type mapping, and the framework cares for all of your changes automatically, or you can also, for complex cases, define a so-called legacy-type mapping.
Then the storage or the legacy objects will be updated through runtime, so you never have to stop your application and refactor the whole storage. This is not how it works. We have a file system garbage collector, as mentioned, a file system abstraction, which means you can store your data in a Blob store, but you can also store your data locally, just on disk. You can store your data almost everywhere, so in any binary data storage. This is confusing because a relational database can deal with binaries. You can even store your binaries in a relational database, but keep in mind, there is no object-relational mapping anymore. We just store binary data. There are database connectors that you can use on Oracle database, you can use PostgreSQL.
Actually, it makes no sense, but in some business cases, it can make sense. We had a customer. They used Oracle. They told us, that’s a great approach, but we have to use Oracle. Now we store EclipseStore binaries in an Oracle database. It’s possible. The Oracle guys do pretty much the same with their graph layer, so they provide a graph database, but it’s not the graph database, it is actually a graph API layer on top of the relational database. They store graph information as a binary in a relational database.
Then we have a storage browser, where you can browse through your storage data, and a REST interface, so you can get access to your storage and search and query your storage directly via REST. There are backup functions and converter to CSV, for instance, that you can migrate easily to EclipseStore or from EclipseStore to any other database, if you like. It runs with JVM from Java version 11. It runs with any JVM languages on Android. It runs in containers. It runs in Docker containers on Kubernetes, even with GraalVM Native Images.
We talked about single-node applications, and this functional principle works also in distributed systems. For this scenario, MicroStream provides you a PaaS platform for deploying and running distributed EclipseStore applications. We also provide an additional version for even more performance, with indexing, for instance. You get out the most speed that’s possible. How does it work? Now we can execute an Eclipse application on multiple machines, and the MicroStream cluster provides you data replication, data redundancy. The service is fully managed or available on-prem. There is eventual consistency approach. This is how it looks like in a distributed environment.
Back to our previous architecture, we have a Hibernate application running on multiple machines, a distributed cache, we have a database system. Now we replace the Hibernate applications with the EclipseStore applications. As we keep and query all data in memory, it is already working like a distributed cache. We store data in a Blob store, AWS, for instance, so we can skip the database cluster completely. As mentioned, we keep data in memory, we replicate data in memory through multiple JVMs. We don’t need a distributed cache anymore, so we can also skip the local cache. Then you can still use Elasticsearch if you like, but you can also use Lucene, and you don’t need a search cluster anymore. It depends on you. Then, the end result is a really small cluster architecture, low cost, super-fast, easy to implement and maintain because everything is Core Java. It feels like a part of the JVM, it feels like a part of the JDK.
Importing an EclipseStore Binary File into Lucene
Participant 2: You mentioned Lucene, so can you import an EclipseStore binary file into Lucene and then it will just work?
Kett: No, this is not how it works. You can use and combine Lucene with EclipseStore as you can use all Java libraries and combine it, that are available in the Java ecosystem. Lucene cannot parse the binaries. You include Lucene and you will search and filter in memory. The binary files are only used for storing the object persistently on disk. You never touch the binary file. It’s the same with your database server. Your database system will store the data in an internal format on disk. You never touch it, actually. It’s the same here.
Rules and Challenges (EclipseStore)
There are also some rules and challenges with EclipseStore because every technology has pros and cons. There’s a comparison. Here’s, again, the traditional database server paradigm. We have an application and we have a database server. Queries are executed on the database server. The persistent data are stored in the database server, obviously. With EclipseStore, it changes. Now, your database is in memory. You don’t have to load the whole database in memory, but it works like the same. It feels like the whole database is in memory, but it’s not. It’s managed by lazy loading by the engine.
Keep in mind, your database is in memory. We search and filter in memory in the application node. Only the storage data are stored in a S3 bucket or something like that. That’s the main difference. You have to think a little bit different. There are no more classic select, you send to a server. You don’t use SQL, you use Java Streams. There is no database server. We have a graphical user interface where you have to create a database model. You just have to create classes. That’s it. There is no more database model anymore.
Again, in-memory means everything is executed in memory, so you actually need a lot of memory. If you have a lot of memory, I showed you how you can save a lot of memory, because we don’t need a database cluster. We don’t need a distributed cache cluster. We have a little bit more money left for buying a little bit more memory. If you don’t have enough memory, you have small memory machines, it will work. This is very important. The more memory you have, the faster your system will be. This is not standard, but I like to mention it, if you need a way faster approach for really blocking operations, transaction safe operations, with the speed of an asynchronous approach, with really high write performance, then you can use, for instance, persistent memory. This is super interesting. You just have to add persistent memory to your server.
Then, all write operations are not directly stored to disk, it’s stored in a persistent memory area. It’s transaction safe. It takes microseconds to store it and not milliseconds because of disk I/O operation stuff. You can store it in a high-performance way. It’s like you copied from one memory area to another memory area, but this area is persistent, and it provides you persistence. It’s called persistent memory. Then, behind the scenes, you can synchronize the persistent memory with your disk asynchronously. This is extremely fast.
Challenges with EclipseStore. The biggest challenges are, you have to think like a Java developer. Java developers mostly don’t think like Java developers in terms of database programming. In terms of database programming, our brain works like a relational database. If I tell you, “Please create a database application, I need a shop system. I have customers. I have articles”.
Then, your brain will create a relational model in microseconds, sometimes milliseconds, because we are used to using a relational model sometimes 10 years, 20 years, or 30 years even. You have to stop with relational modeling. Create an object model that fits for Java. Forget what you have ever heard about a relational model. Forget what you have ever heard about a relational database system, how it works. Focus on how Java works, how you would implement it in Java. Trust the framework will be able to store it. That’s it. That’s the biggest challenge, to create a proper object model. It’s built for Java developers. We have no surprises for DevOps and for database admins.
This is the reason why, if you have colleagues, they are database admins, probably they will not like it. This is not a drop-in replacement. Please stop dreaming about, there is a magic button. I can now replace my Hibernate stack and my relational database with EclipseStore and it will work seamlessly. This is not going to happen. There is a migration effort and path, but it’s doable. It’s not complicated, but there is an effort. Keep this in mind. No SQL support, but the application can be queried by external services and applications by using GraphQL, REST. This is possible, but no native SQL support, obviously.
Conclusion
Traditional database applications. This approach provides you simplicity. It’s because you can deal with all Java types. There is no more mapping, no more data conversion behind the scenes. It’s Core Java. There are no dependencies. You can use POJOs, and everything can be stored. It can be replicated. You can build distributed applications very easily. You will have high performance. Because of the speed of Java, all operations are executed in-memory with Java Streams in microseconds or even faster. It’s suited for low latency, real-time data processing. You have really awesome throughput. It will save a lot of cloud costs because there is no database server anymore. There is just storage. Just storage is more than 90% cheaper than any database server in the cloud. That’s great. Because there is no more server required, and these numbers are from Amazon, you will save more than 99% of CPU power, energy, and CO2 emission. Here is a comparison of what could be saved if we replaced all database servers with object storage. This would be amazing. This is not going to happen. This is only in theory. Between 20% to 30% or probably even more servers on the planet are database servers. These numbers are growing because of AI. More vector databases are required. We could save a lot of energy and CO2 emission.
Resources
If you are interested in learning this approach, I have a free course for you. You can enroll for EclipseStore course for free. We provide advanced training and even fundamental training for free. If you’re interested, check it out, www.javapro.io/training. Build the fastest applications on the planet by using Java.
Questions and Answers
Losio: You say, I never have to access directly the storage layer, so S3. I don’t care about how you store the data in S3. If you have 1 million records in your table, do you store 1 million binaries? It’s one file? How is structure there?
Kett: Behind the scenes, the engine will reconfigure the storage constantly, and reorganize the storage constantly. You don’t have to care about, how does the structure look like in my storage. That’s done automatically by the engine. We have a garbage collector process which deletes the legacy objects. You can configure that. This is how it works.
Participant 3: As far as I understood, you position the solution as a drop-in replacement, for DBMSs, or for enterprise applications, or just different, or just for embedded applications.
Kett: It is a persistence framework for storing Java objects, this is what we had in mind, to replace Hibernate. You use Hibernate to store your objects in a database. You use it for almost all use cases. You can build complex enterprise applications, or you just store your tests, or anything that can be stored. You can use it for almost any purpose. It is great for low-latency applications, where you need real-time speed, where you really need high speed. That’s great to use it for that purpose.
Participant 3: I’m not a DBA, but I would like to protect the DBMSs. There’s five points on the slide regarding the implementation that you will have to do. On your application side, it’s not fair for me to mention something on top of your business logic.
Basically, all the stuff, if you know about the Postgres and MVCC, Multi-Version Concurrency Control, a very complicated thing that allows you access to the data storage from multiple applications. Also, regarding the tools, so good luck with doing the updates of these Java applications, as soon as your enterprise application requirements change. You have DML, DDL, and all these high-level abstractions. I’m talking about SQL like things, that allows you to do very complicated things, just with a few lines of code, instead of implementing very challenging code on the Java side. Do I understand right, that this very complicated layer, like concurrency thing, that is not comparable to this enterprise-y thing that we just do. It’s very complicated. Does it mean that the enterprise application developers have to deal with that as well?
Kett: Obviously, the database cares for concurrency and everything. You don’t have to care for anything. In practice, we see that we have to care for concurrency. We do it anyway, in Java, very often. With microservices, it changes completely, transaction safety and so on. With Java, we have great solutions for that. In our perspective, this is not more effort. It is pretty much the same effort, because mostly you have to do it anyway. You need experience with concurrency handling in Java. Actually, it’s Core Java stuff. There are no new things to learn. This is not like a SQL database, you have to learn a new data model, new query language. It’s Core Java stuff.
See more presentations with transcripts