Transcript
Adi Polak: My name is Adi Polak. I work for a company named Confluent. We’re building a data streaming platform. We’re contributing to Apache Kafka and Apache Flink and Apache Iceberg. We have a lot of Apaches project that we contribute to. I wrote a book for O’Reilly on scaling machine learning with Spark specifically. My second book is coming out on how to improve high-performance Spark, the second edition. Started my career in the machine learning space 15 years ago, moved into data infrastructure, batch processing, and a year and a half ago I moved into the data streaming space, which I think it’s what’s going to help us pave the future in the data.
Sarah Usher: My name is Sarah. I am a software engineer turned data engineer. I’ve worked across various domains, from banking, law, insurance, ad tech, developer security. I like to work in all different axes of scale. It’s not just large data processing, but lots of different datasets, cultural challenges, and growing your business and growing your data with your business.
Matthias Niehoff: Matthias, as well from software engineer to data engineer, working for codecentric, a German consulting company. Since I turned data 10 years ago, my then boss said, maybe you have a look at Spark. Wasn’t the worst advice back then, 10 years ago now, looking at this. Yes, helping clients with data architectures, building data platforms, and especially interested because of my software engineering background in how to get data from operational systems towards an analytical system. What can actually data learn from all the software engineering stuff? Also, what can software engineering learn from data? Always trying to mix those two worlds.
Software Engineer vs. Data Engineer – The Transition
Participant 1: This is a slightly more human question, data question. Since two of you mentioned you were software engineers turned data engineers. Two things I always feel like I struggle with is, first, explaining to leadership the difference between a software engineer, a data engineer, and a data analyst. The second part is how to really talk to people who want to make a similar transition and try and explain to them what it actually means. For context, I would also count myself as a self-identified data engineer that doesn’t have time for it because I’m a manager. It’s something that I struggled with before. I’ve seen people go into it and be really unhappy because they didn’t really know what it means.
Matthias Niehoff: Your question is basically about how to make it clear what it means to become a data engineer from becoming a software engineer, what is the difference? Because we are a software consulting company having a lot of software engineers, and we’re looking for data people, actually. Data people are hard to find currently. We’re looking for software engineers turning data. That is exactly the point. I’m a very hands-on person. For me, it’s always like I have to do the job once or twice. That’s the way I learn. So many people can tell me, yes, but data is different. Data, it feels other, because you have this data dimension now, and you have to test it somewhere. The technologies are different. You can learn a lot by courses, by getting hands-on on your play projects and pet projects or something. You have to do it and learn it. I think this is the most important part, and feel the challenges.
So often with failures, a lot of people say, “Don’t do this. This will fail. This will fail”. They will do it regardless. They will have to feel the pain first most of the time. Learning the data engineering part is really like you have to do it. More interesting is convincing someone that it’s worth going into data engineering, and that data engineering is not a down step in career. Because that’s what I see for multiple people say, “I don’t want to do data because I’m an engineer. I’m doing testing. I’m doing CI/CD. I’m doing all this good practice of stuff”. I think this is really up to us to show data isn’t like this click, click, click, and UI. You just build graphs and anything. It’s really software engineering. All your practices you have gathered before are actually valuable in the data world. It’s really valuable if you do the transition and transfer going over. This is really important to say and to make it clear.
Sarah Usher: I don’t know if this is a controversial opinion. I believe in data engineering, I don’t believe in data engineers, which given that I am one is slightly challenging. The reason I say that is because having come from a software background and I’ve taken that experience and applied it to data systems, yes, not every problem is exactly the same, but you do apply very similar thinking. A small thing, for example, is I might do inside-out TDD when I’m building logic in a service, but I do outside-in TDD when I’m looking at data pipelines because that perspective is easier. It’s the same kind of principle. I still need tests. I still need a very logical way of building systems. I often actually refer to myself as a data and backend person, mostly because I’m bad at frontend. To me, it’s the same.
The reason I say this is because I just feel like when I started, there was no such thing as data engineering. They were data analysts, and I feel like that’s a pretty distinct role. Like having analytical skills and doing analytics is quite a distinct skill in my opinion. Not to say that we don’t share some tools, but the way we use them is slightly different. When it comes to building systems, and especially when we’re looking at distributed systems, there is more and more overlap. I don’t really see a distinction between a product engineer and a data engineer as much.
Also, when I look at people who call themselves architects, solution engineers, and you’re looking at the IC, you’re looking at staff and principal engineers, if they’ve only got data engineering, or they’ve only got product engineering, they’re not actually very good because they don’t have that broad perspective. That’s why I say I believe in data engineering. I believe that there are paradigms that apply to data systems, but I don’t necessarily believe that it’s a singular role, and that’s the only thing you should do.
Adi Polak: The way I see it, and also data engineering didn’t exist when I started my career. Essentially, some folks that come from a background as a DBA, very deep understanding of SQL, SQL databases, how to optimize, how to create procedures, how to optimize queries, how to go about that, versus people that come from software engineering, distributed system world. Then there’s interaction in the middle, because when Spark was brought to the world way after MapReduce and so on, and the idea of NoSQL emerged, we actually asked software engineers to start learning about the data world and combine their skills of distributed systems together with the data space. Now as we’re evolving, and there’s more solutions out there that enable us to separate compute from storage in a very efficient manner, we’re slowly parting ways with Hadoop.
People oftentimes don’t need to know about the trenches of Hadoop if they’re not running it themselves. There is returning to the basics of what is data modeling, how do we go about it? How do we optimize queries? How do we create the procedures? Now we’re still operating in a space where we have different compute engines under the hood, and we either need to know which one to choose or manage, or it’s abstracted to us. It’s a mixture of skills and expectations from the industry that always evolve with time. The basics are true.
If you don’t know the types of your data, if you don’t know how a query engine works and how to optimize, or if you don’t know how to bring the right data in or give it a structure, you won’t be able to create any data pipeline at the end of the day. It’s just becoming more and more challenging because we’re building on top of existing skills and people with previous knowledge of how the core of this distributed system works are actually going to be the people that are able to solve the problems. Because the high skills, it’s not in creating the pipeline, the high skills is actually when I have a production problem and I need to solve it.
Who is the engineer that I’m going to wake up in the middle of the night first that’s going to solve it within a couple of minutes, because they’ve seen it before? They know how the internals work. They know how the system operates under the hood versus the new one that just started and they haven’t yet figured out the end-to-end architecture. This is the challenge that we’re facing today with data engineering. It requires years of experience. It’s usually not something that you just finish university and you’re good to go.
Data Architecture Modernization
Participant 2: I just want to get an overview of, in our organization, are we getting it correctly in the current format that we have it in? Maybe you guys can just comment on it. I think in our organization we were traditionally in the BI space, traditional BI where you’ve got your data warehousing specialists and you’ve got your modelers like BI specialists that are more leaning towards your data analysis in terms of running queries and connecting to Power BI and doing those types of reporting. What we’ve recently had is that we’ve had a shift in terms of creating paved paths to say that a person that was traditionally a data warehouse specialist then now particularly becomes a data engineer.
Then a person that comes from a software engineering background and is coming into the space where they’re interacting with a system or the data systems or those data warehouses or data lakes. We then now term them as platform data engineers because they do a bit of platform work and then they’re actually getting into the data spaces where they’re connecting or integrating different systems. Then we then now have your traditional BI specialists that were more about modeling, creating views and things like that as you’re then now your data modelers and your data analysts. I just wanted to just gauge to say that are we in the right direction or is it a time for us to basically just have everyone as data engineers? We’re just trying to find our space right there as well.
Adi Polak: I always say, are we heading in the right direction is always a question, is, do we have any tools to stop it? If the answer is no, it’s just, we’ll adopt. It’s rolling and if this is where the industry is going, we can either adopt and build on top and learn and gain the right experience or we can try and say, no, this is not true. I’m going to be left behind not doing what everyone else is doing. It’s a little bit of a philosophical question if it’s true or not. I’ll give an example. I’ve been through machine learning winter many years back when I started my career because I started in machine learning and then I moved to data infrastructure world. We had a huge winter. No one was using machine learning. Everyone said it’s a hype. We went back to the data space. Some people stayed in the machine learning space and continued to evolve that, and some people didn’t.
Back in the days when there were no jobs in that space, a lot of colleagues and also myself decided to do the pivoting because at the end of the day you want to stay relevant in your career. You want to stay relevant for the current industry. Now we have a huge booming that started about two or three years ago with AI and generative AI. When I saw it initially my gut reaction as someone who went through a similar experience before was, I know what’s going to happen. We’re going to have a hype and then we’re going to drive down back into the data infrastructure because at the end of the day the quality of your models highly depends on the quality of your data, and we’ll continue that loop and we’re either going to improve it or going to learn for the next steps or not.
Again, this is driven by experience. What I did for me and my team, I’m responsible for two teams in Confluent, is, what is the data aspect of these things? What can we learn? How can we enrich that, so when it comes full circle and we’re going to go back to data quality, data modeling and so on, we’ll be smarter? Again, it’s something that only comes with experience. You have to go through something like that or you have to work with someone who went through that and you have to take big risks because at the end of the day we’re betting on something that’s going to happen in the future. I think your question is really important, like, what should I invest in? It’s a question of, what did you go through? Where in your question did you talk to? What do you believe is going to be the future? How do you set yourself up for success in the next two or three or four years?
Matthias Niehoff: Your question is saying, you’re having this BI team and you have your software engineers and now you’re wanting to do modern data architectures, and you’re moving the software engineers, they should build the platform and you have the former data warehouse specialists, they should actually build the data model on top of it and you’re separating the roles, having different roles with this. I think this move of modernizing the data architectures and getting software engineering knowhow into it and platform oriented is pretty common. I’m not sure if I would actually separate those roles and make it like you are a data engineer, you are maybe data analyst, ok, because this is special. I’m not sure if I would really make a difference between data engineer and data platform engineer. Yes, the BI and data warehouse folks will have a different skill set, but I would say, this is one team, those are data engineers responsible for building a data platform. I wouldn’t do like, you only do modeling and you do platform. I think it would be good if both know the things.
As a data engineer you often think like, yes, Snowflake and dimension and modeling, then I have to understand the business, this is really annoying, I just want to do tech. I think if you say, you together as a team with different skill sets and really a diverse team, you are responsible for this and you are all the data engineers or you are the platform engineers, whatever you call us, would be more helpful than to really micromanage those roles.
Sarah Usher: The company that’s going to fire all their BI engineers to hire ML engineers versus the company that’s going to invest in their BI engineers into data engineers and ML engineers is going to outweigh the first one, they’re going to win.
What’s the Right Technology to Invest In?
Participant 3: I wanted you guys to draw based on the experience that you have, recent experience, plus also there’s a philosophical element to the question as well. There is a lot of innovation taking place in the data space, there are a lot of products coming out. There are already a lot of products. I’ve been attending sessions here talking about embedding agentic solutions, indexing, and whatnot. There are different, even quite a few vendors out there that have different solutions. As a firm, how do we decide what technology to invest in, what to select. Also, the challenge is, because these are new technologies, you don’t have that much experience in the market. When you deploy or develop industrial scale applications using these technologies and then they go wrong, how do you solve that problem as a firm? I work in financial services, you invest a lot of money, build something, take six months, maybe two years, and then it doesn’t perform, and then you’re struggling because you don’t have the right engineers or the skills in the market to support something like that.
Adi Polak: Or the company closed down and then there’s no one to talk to about the solution because they went out of business.
Participant 3: As a bank you usually don’t end up in that situation. That’s why we take slow decisions and weigh up options and risks.
Adi Polak: We go through a lot of build versus buy conversations and a lot of technologies to choose from. One of the guiding principles is, do we trust this company? How long were they in the market? Are there success stories? Are we willing to bet on being one of their first customers to try it out and understand the level of risks? Specifically in the agentic AI space, there are new models every day. There are new solutions for prompting, retrieval embedding capabilities time and time again. I think the true essence of it at the end of the day, you’ll have applications and you have a data pipeline. This is something that as an industry, I believe we established which technologies can help us in building applications. If you’re a Java shop, you know your Java capabilities.
If you’re a Python shop, you know your Python capabilities and so on. On the data pipelines, you probably have some solution that already exists and works for you for data pipelines. If we can have it very generic that we can do some separations of responsibilities into which embedding we’re called to, changing a function call, it makes it easier for us to operate and replace things later on. It’s still a migration. It still means we might need to move data from place to place and do that operation. If we build the right architecture, it gives us the right flexibility to do it. Probably in the embedding space, in RAG specifically, there’s a lot of new algorithms that continuously evolved. Existing big players are continuously amping their embeddings to make them faster, lower latency, and so on. Foundation models, or other models are improving continuously.
This is something that will require us to change. The infrastructure is going to be the same. We’re probably going to need GPUs. We’re probably going to need to feed data into the model. We’re probably going to need to massage the data before it goes into the model. We’re going to need to have a monitoring. We’re going to need to have governance. The basics of the architecture that we’re building is going to stay the same.
As long as we’re having these building blocks set up in place, and we’re building it in a way that is modular so we can easily replace whatever we have for the models or whatever we have for embeddings, I believe we’ll be set up for success in that field. Someone asked me, who’s going to win in the embedding space? I was like, “I don’t know. There are new papers every day. I don’t know who’s going to be the best. Let’s talk again in two years”. Really, it’s very hard to tell. Flexibility, I believe is the way to go.
Fabiane Nardon: The truth is that probably the decisions we are making now are wrong, and we are going to find out in five years because the technology is too new. The data space is a little more stable because the things we have to do with data are not going to change that much. Considering agents and AI, probably five years from now, the landscape is going to be very different. There’s no way you have to start experimenting now to be prepared for what comes up in five years.
Matthias Niehoff: Data is moving fast, but not as fast as GenAI, definitely. For me, it’s like, separating in the building blocks, observability, and so the round stuff, and then things like storage, compute, catalog, those are so many things, and visualization, and then having standards and building on top of those standards. Having open standards, especially if it comes to storage, in my opinion, that is pretty open. You don’t have to go to pay some database vendor or whatever money just that it stores the data. You can have free formats, and then augment those open solutions actually with paid solutions. As a bank, you’re most of the time making decisions slow and late, so you have the advantage of knowing what works in the market, what fields, what players are there, what are stable, and this is a pretty good strategy.
Tech Maturity, and Skill Set Availability
Participant 3: Any thoughts about the skills availability? How do we assess how mature is the technology in terms of available skill sets?
Matthias Niehoff: How would you do this? Several strategies. The one thing is doing work yourself, so building proof of concept. Will this technology work for your problem? I think this has to be the first fit, of course. Then going to conferences to see what others are building, are others using this? Also checking the community. Is there a community? Are there people working with this? For instance, if you decide to go with Databricks or Snowflake or also Confluent as some platform, I would say, this is so large, there are so much users. It’s like, you don’t get fired for buying IBM. If it’s coming to smaller stuff, yes, it’s harder. The homework is really getting experience from others, maybe going to user groups, talking with others, if there are some people. That’s what I would do.
Sarah Usher: I’ve had more success using the professional services and training the people I already had, rather than trying to find the new skills, because it’s just moving too quickly.
Matthias Niehoff: Upskilling your existing workforce is way more efficient, of course.
Software Engineering Roles/Skills
Participant 4: I want to go back to what you spoke about, that you like data engineering, not data engineers. If we look at it and we bring the platform work, the data work and the modeling work into one role. How does a modern data architecture team look like? What would the other roles be like? How would you group them in your different teams? How do you actually grow those people? If we’ve got people who say, I’m a platform engineer, and someone says, I’m a data engineer, we could say, ok, there’s one role that we want to achieve, which is data platform modeling. That’s what this person does. What other roles do you have maybe in your teams for a data team? How are you growing these people, so that they have these personas into one role, so that we have a structure where people work on the different things? Though I’m not a platform engineer, but in my role, the persona of a platform engineer is something that I’m required to do.
Sarah Usher: I think this is a challenging question, because we don’t really have good standardization of roles in software. You can be a staff engineer in one place, you can be a software developer in another place. What’s the difference? It really comes down to what you want to pay someone, and what you want to call them, and how you want to map them in your career structure at the end of the day. That’s what a role is for. It also varies a lot between the size of your company. If you’re going to work at a very large company, you’re going to naturally have more specializations. If you’re going to work at a startup, we all know you have to wear 50 hats. That’s a challenging part. Look at the problems that you’re trying to solve in your business, look at the areas that naturally group together in skills, and call them what you want, data software platform engineer.
Matthias Niehoff: Look at skills, not roles. What skills are needed actually to build a data platform? You mentioned some of them. This is the modeling part. This is building pipelines. This is the platform part. We can learn a lot from all the platform engineering, which currently focuses a lot on application platforms. You can learn a lot from platform engineering ideas as well for data platforms. Look at skills. I think this is more the thing. If you then separate those, say you are more in a platform and data engineering, so you’re a platform engineer, you are more in the modeling, so you’re a data engineer or a data analyst. Depends on the organization. I’ve seen all of this in the end. My title is principal consultant or whatever. I don’t care. Do what is necessary to do. The role is just a name, at least in my company. It depends on the company, of course.
Adi Polak: I can give a managerial perspective of how as a manager I see it in my organization. My goal as a manager is to create impact for the organization. Really, the title, the function, whatever, it doesn’t really matter. There are things that need to get done, and we need to get it through, and we need to get it out for our customers. At the end of the day, if someone is a platform engineer, or a data engineer, or an AI engineer, it doesn’t really matter. We have a mission, and we need to get it through.
Each one of the people have different skills, and each one of the people has what I call a level of hunger. The wanting to learn more, the wanting to get things done, the wanting to get their hands dirty as we go along. What happens specifically with AI, when AI started, everyone wanted all the tasks related to AI, because it’s fun, it’s interesting. There’s a lot of hunger and desire to grow and learn the skills in that space, and actually get a real-world experience in that. What I observed from my teams, essentially, is people that found ways to incorporate AI in everything that we do, and later on, we start learning together about how to optimize, what worked, what didn’t work, and so on. I think, especially starting early for someone in their career, being very laser-focused on, I’m very good at this specific technology, is great for the first two or three years.
After that, when you start hitting from a junior to a senior to a staff to a principal, really what you’re delivering at the end of the day, it’s not, I’m super well-versed in Apache Spark. It’s actually what’s the business outcome that you were able to create through the red tape, through the challenges, through working across teams and across departments, and so on. My question to you would be, what’s the impact that you want to create for the organization? That’s it.
Career and Skill Growth
Participant 4: How have you grown these people so that they are able to then wear the many hats to achieve the objectives?
Adi Polak: Each one of us are responsible for our own careers. We have managers, they’re there to support us, but the manager’s first responsibility is to create impact for the business. Each one of the people needs to find ways to grow themselves first.
Matthias Niehoff: As a manager, you have impact for the business and making the room and space for improvements of the individual, so making it possible. It’s about psychological safety. It’s ok to fail. You don’t have to know all of this. There’s so much stuff happening right now. Try out. Raise your voices, raise your concerns, if you discuss, let them into the discussion. If you just have the one guy that knows all of them and says, just build this, build this, build this, yes, they’re learning and they’re doing it, but it’s not efficient learning.
Adi Polak: I can buy a course for the team, but if the team won’t go through the course and be very detail-oriented, what does it matter? I just spend money and put it out there for people to go to a conference or to take an online course. At the end of the day, it’s on the people to want to learn and do.
Roles and Architectures of the Future
Participant 5: I’m wondering what will be the new job names in the near future. Also, I would like to talk about the architecture part because we had Matthias talking about data platforms, shifting left the data and application platforms. Then we had Adi talk about the agents and the RAG systems. Do you think it’s going to interconnect and co-exist in the future? Maybe you can elaborate on this.
Matthias Niehoff: I think it’s obvious that your data is making a difference for RAG. You said that precision happens through the data you have. The data is stored in your application. At some point, it has to be connected somewhere. I’m pretty sure the first solutions now, if we’re building quick one-off solutions and just want to get something out with some agent or RAG or something, is somehow connecting it in a dirty way towards the operational system. Just get the data out somewhere. I think this is what is going to happen in most of the cases. In some cases, there’s some integration architecture in place where this can be interconnected. It might be Kafka. It might be something else. It might be whatever. I also hear the term data-centric architectures, oftentimes. I think it’s one thing that is important.
Data is getting more important in all our architectures, in all our systems. It’s not like we have an application where things are done, and data is just the byproduct that happens to be there. Data is what’s fueling this application and running those applications. It might be that there’s more like a central data, that the data platform is getting the central, and all the applications and RAG agents might be just applications connected to those data platforms. That might be a thing that could be possible.
At least there are some ideas of doing this. It might be that still applications are in the lead, and it’s getting more like a mesh style, so really like data mesh style. I think this is the interesting question in data architecture and connecting those systems right now. From my perspective, in a software architecture, I would agree. I wouldn’t decide between software and data architecture. The reality is, in my opinion, software architecture takes care of software architecture but doesn’t think a lot about data architectures in most of the companies.
Sarah Usher: This is where when you specialize too much, then you actually have a problem, because you don’t have anyone who’s looking at the overlap and looking at the bigger picture.
The Data Platform Team
Participant 6: I lead a data platform team, and I feel like we’re often stuck in the middle between the higher-ups management and business versus the app development teams. Because I feel like today in the app teams, data is often seen as a side effect, and their main focus is on the operational plane, whereas the business and the management part of the organization, they’re always like, where’s our AI cases? Where’s the agents? Where’s the cutting-edge technology? As the data platform team, we’re often stuck in the middle trying to manage between that. I was wondering if you have any advice for how to deal with it.
Fabiane Nardon: What I think is that more and more, especially with AI and everybody having to do AI applications and agent applications, the application team is needing more from the data platform than they needed before. The value of having data in a data lake or available for doing AI and agents is becoming more relevant. What I’ve seen in the teams I manage, usually people, once they realize they need data to do the AI application, everybody is doing it. It’s very hard right now to find an application that doesn’t touch AI in some way. This need is becoming more relevant. Then exchanging data between teams and having architectures that can solve this problem is going to be important for the application team as well. I think the application team is going to use the data platform more and more in the near future. We have lots of problems to solve in this space.
For example, latency of the data for solving agents, for example. The data platforms have to evolve as well. That’s why we are doing this track actually, to get all the ideas and try to inspire people doing data platform for what agents and AI applications are needing. I bet the application team is going to pay attention to the data platform team very soon because they are going to need to use the data platform to solve their problems as well.
Adi Polak: I think doubling down on what you said, up until not so long ago we saw data streaming in one direction, upstream to downstream from applications to the analytics to the data platform itself. Now with agents and our AI applications, we understand that these applications need fast access to the data with low latency. Which means that on the data side we need to start thinking about smart indexing, embedding, better caching mechanism as well to expose that data to the application side in a way that makes sense to them, that answers their low latency requirements and so on.
I don’t know if you have any stream processing or data streaming today or only batch processing, but starting to look into how to add data streaming capabilities is going to be probably the first step, and how to expose that data to the application in a way that it’s secure, governance, there’s some level of quality, there’s access control from a security mechanism as well in place. Because the way I see it, if leadership, the top level is asking for more innovation and everyone is now dealing with that exact situation: application side is trying to figure out which applications to build, where it’s going to be ROI, what to do.
A couple of months from now they’re going to come to the data platform team and be like, “I need fast access, five minutes is not good enough. I need low latency. How can you expose that data to me in a way that I can actually make use of it?” It’s worth exploring and speaking with the applications team to understand what type of agents or AI they’re looking to build so you can actually on the data side start preparing the roadmap so you’re ready for when their requests will come. Because if this is what management wants, it will eventually happen.
Sarah Usher: I think this wave of AI is putting a lot of pressure on data platform teams, but it is also an opportunity to focus eyes on the data platform again. Because it can’t just be a dumping ground where stuff happens. It’s now really important. Use that as an opportunity. You suddenly exist, data platform, ok. Can I have my stuff faster? You can, but I’m going to take you on a journey first about how your data actually moves through the system, and you’re going to work with me, because I’m not just going to magically give you fast data. It’s your data. You own it. I just help you as a platform person get it through the system, and here’s how it works. Let’s work together to get it out. That’s how I’m using that as an opportunity to get the product folks to, A, realize we exist, and, B, accept ownership of data, and, C, to work with them to actually then build systems that can then serve the business. Because again, fourfold, it doesn’t really matter what you do, that’s the mission. It really serves as an opportunity as well.
Adi Polak: It speaks to the data contract in the shift left session.
Data Platforms in Highly Regulated Spaces
Participant 7: More of a practical question from your experience, a lot of the time we talk about data platforms in terms of volume and speed and insights, but in highly regulated spaces we have other challenges. I was just curious your thoughts and trends when and for the room, root is untrustworthy, and the data analyst is not allowed to see the data that’s not explicitly given permission. All files are encrypted. We’re exploring things like fully homomorphic operations, Secure Enclave compute, but we bump into little problems where the software password is in the config file for the open-source project. Just thoughts or advice or any experience you have when operating in a data space, let’s say that heavily regulated space.
Matthias Niehoff: I have not that much experience in this really, like where it’s about health data for instance or something like this. From what I’ve seen from the financial space is where it’s really like sensible data is you only get so far with open-source software, because then it’s all the enterprise features that’s stuck in, and all these data protection features and so on. If you try to build this around or with open source, you most likely end up just building a lot of stuff that you would definitely buy.
For me, I would try to solve it as much as possible with the capabilities of a data platform you have, like column masking, row level security and all this stuff. I also see as data is completely encrypted at some point it’s really getting tricky and we really need special solutions for some points. If it’s encrypted on file with a custom managed key and so on, yes, it’s all possible. There are all solutions for all the platforms, but it’s really getting tricky then. It’s just the case from my experience. It’s not that, “Shiny features. You have these slides here, this is the architecture”. Then they say, “Just a moment. I have those special requirements”.
Adi Polak: Yes, similar aspect. The platforms today, if I look at Databricks, Snowflake, Confluent, some of their additional value is in the security, is in the encryption. At the end of the day, if you’re choosing to use an open-source solution, you will need to take care of the rest of the things yourself rather than if you can afford to buy an existing solution, some of these capabilities are already going to be built in. You didn’t ask about LLMs and things like that, but my advice is always, if you can deploy it in your own cloud so your message doesn’t need to travel and it’s close to where the actual processing happens, it’s better from a security point of view because you keep it in your own perimeter, in your own private networking. There are tools out there like Databricks that allows you to deploy your own model. Snowflake does it.
In Confluent, we’re enabling it today next to our Flink offering as well. It reduced a little bit from the security challenges of sending a message out there to the world. Again, it doesn’t exist in open source and most companies either build it or buy it.
Matthias Niehoff: For instance, is it also on-prem as a requirement for you, so not using cloud?
Participant 7: On-premise is not so much as a requirement, it is the legal position of the administration. I cannot outsource to someone who I don’t legally own. The administrator is an on-prem often, but I can put the box somewhere else. It’s the person that I have to own.
Matthias Niehoff: That has to be your own. Because I have customers that for some reason, might be in German reasons, who are completely saying, we only do on-prem and that’s a right on its own, but you can’t use SaaS services.
Data Sharing in a Hybrid Data Platform
Participant 8: My question is about a hybrid data platform architecture, in the nominal case, when we have, for example, data lakes and data hubs, for example, on on-premise infrastructure and on clouds like Azure or AWS. Based on your experience and your point of view, which is the best to copy that, in order to share data between the two worlds, on-premise and cloud, is to copy data from on-premise to cloud and the inverse, or try to expose data, knowing that we are talking about big data, or huge data, so using API, when we have huge data, it’s more difficult. Based on your experience and point of view, what’s the best way to share data between on-prem and cloud?
Matthias Niehoff: It depends. In the end, you have data transfer costs. If you can query the data the way that you can really say, ok, basically the query only goes on-prem, then just execute it on-prem. If you say, I always need the data, and it’s really all of the data from different sources, then most likely it’s benefits of having it all in one place. Also, data transfer charges and the cloud egress, ingress costs might be a thing. It can be pretty expensive. There isn’t the one answer. It just depends on the requirements. From a technology perspective, there’s enough solutions, but it really depends on what the requirements are.
On-Prem vs. Cloud – The Cost Factor
Participant 9: We moved both the application database onto the cloud and found the database taking much more space than the application. Is it still worth doing that, or some technology can reduce the space it’s taking in the cloud? These are cost issues. Eighty percent of our cloud space is database, only 20% is the application. Is there any technology that can help with that?
Adi Polak: It really depends what type of database it is. Can you share a little bit more, perhaps, about the problem?
Participant 9: At the moment, it’s not that much BI. It’s just an operational database, an OLTP database. It’s just like we are hosting our application for our client, and then we have to buy a lot of space from AWS, and 80% is for the database.
Adi Polak: When you say space, it’s S3, or?
Participant 9: I don’t know about that.
Adi Polak: VMs? Where is the spend?
Matthias Niehoff: If you move this database from on-prem to cloud and just run the database as it was on-prem and run it in the same way in the cloud, it definitely will cost more. You will definitely have to architect for the cloud. It should be cheaper, actually, if you’re having an application before. If you’re using the Aurora database, for example, and using all the advanced features and have all ticks on and marks on, yes, it could be more expensive. Cloud doesn’t make it, in general, cheaper, but it doesn’t make it, in general, more expensive.
Participant 9: It’s probably a lot more expensive. At the moment, it’s like the whole database on-premise goes for cloud. That makes it really expensive.
Adi Polak: It’s a little bit about going into the weeds of AWS and services and so on, but sometimes you’ll have different tiers for storage: cold, hot storage, the Iceberg table buckets and so on. Each one of those has their own pricing model. Some of the things that you might have been used to doing on-prem, it’s like getting the fastest, having hot data, having everything in Iceberg format, for example, might cost you more on the cloud because the pricing system works along the tiering. Then it’s a question, it’s like, how do you architect it for the cloud to only use the specific things that you need there? Because when you work on-prem, you already bought the hardware. You have it. Now you just need to maintain it. When you’re moving to the cloud, the pricing model completely changed, and it requires understanding a little bit about how it works, what’s the cost of ingress, outgress, the private networking and so on. It’s a different ballgame.
Sarah Usher: The way you compare the costs between on-prem and cloud is quite different. Just doing a one-to-one like that, I think you’re always going to see different numbers. Maybe look at a different boundary as to what to compare.
Participant 9: Internally, who should I speak to? Which role should I speak to, data architecture, application architecture, system architecture, or data engineering.
Adi Polak: About the cost specifically?
Participant 9: Yes.
Fabiane Nardon: A lot of the work in data engineering is managing cost. Probably 80% of the work is how to make things cheaper.
See more presentations with transcripts
