Transcript
Sudai: We’re going to understand better about the technology called security posture management as an actual technology stack. We want to understand how to identify breaches and learn about threats from our misconfigurations. Let’s start with a quick temperature check. Who here is a software engineer? Who has ever worked in security in the past? Who has ever heard about the term, cloud security posture management? My role as of today is security operations manager at Deliveroo. I have an amazing team that’s basically focused on identifying real-time threats for cloud environments, slightly different than this topic.
In the last decade, I’ve been working as a DevSecOps, basically, focusing on my last role, my main hands-on technical role as a DevSecOps engineer. About the industry that I was working in, I was in the intelligence unit in the IDF. I was in the software world. I was in financial services. Today I’m in fast food delivery. I’m also a mentor and a public speaker.
First of all, we’ll go through the overview, learning some objectives, and what we’re going to cover. The second thing, and the third thing, and the fourth thing is why, what, and how. First of all, we’re going to have a small history lesson about what happened in the last two decades in the cloud infrastructure world. Second, we are going to review the five main breaches that happened in the industry because of misconfigurations in the cloud.
The third thing, we are going to focus on some basic terms. Then we’re going to actually understand what CSPM is and go through the life cycle. We are going to learn how we are identifying what are the right questions as software engineers when we’re trying to configure detection from a security perspective. Then the how, we’re going to practice. We are going to a CSPM flow from the developer’s perspective. We are going all the way from the alerts that the security analyst gets and complains mainly to you about. Then we’re going to go through some best practices and how do we measure that. We’re going to explore what are the latest cloud solutions today. Then, a quick summary.
Overview
We are going to understand what security posture management is. We’re also going to touch, measure, and analyze risk. We’re going to analyze APIs to identify misconfigurations. We are going to create use cases together. We are going to focus on the actual practices and evaluate the solutions that we have.
Why?
First of all, why? Why are we here? Why are we doing this? Up until 2005, physical infrastructure required a manual maintenance. Technicians, mostly IT, were taking a lot of time to fix it, between days to weeks. Today, with the growth of the cloud infrastructure, we are going to focus on how small misconfigurations are being made. How infrastructure is being developed within a line of code. This is very easy, but very sensitive. We are able to damage our security posture of our infrastructure, and we sometimes don’t even know it. Think about how easy it is to instantiate a data lake, a Lambda function, or even a full service, without doing that. Some of the big cloud security breaches in the last five years.
Real Estate Wealth Network leaked 1.5 billion records in 2023. Around 1 terabyte of data was exfiltrated because of a small misconfiguration in the cloud. There was a name, phone number, addresses, mortgages information, and so on that was actually exfiltrated from the company databases. Kylie Jenner and Britney Spears were also impacted by this. Maybe the most known one and remembered one for me, Capital One were fined $80 million in 2019. Back in the days, that was my role. I was making sure that HSBC doesn’t experience the same. The famous hacker, her name is Paige, was bragging about that on Twitter, that she was leveraging a vulnerability within a bucket and was able to breach into the company information.
Who ever heard about those terms? Who actually ever focused on fixing them? First of all, what is a vulnerability? A vulnerability is a weakness point in our system. It’s basically the loophole that we are leaving in our infrastructure. It’s a flaw within our system that we weren’t able to identify. There are a lot of those. Luckily, that’s the good news. Between 2% to 5% are actually able to be managed.
Rather than that, there are a lot of things that we can actually prevent ourselves. What is a threat? A threat is a malicious or a negative event that actually leverages this vulnerability, and make sure that this one would not live by its own. What are we thinking here? We have a vulnerability. We have something that’s been open in our system. We have a threat that actually exists there. That can be maybe a malicious software that’s leveraging a vulnerability. That can be an email with social engineering skill that was able to hack one of our environment infrastructure. Last but not least, what is risk? Risk is what our company assesses within our main environments, the cloud, and if we have on-premise as well, about what is the potential to lose our crown jewels? What is the potential to lose this data? That’s how we, the security teams, are able to identify and align that to you guys.
What?
What? CSPM is basically a technology that helps to identify misconfiguration using its metadata to identify a potential threat that leads to actual breaches. I’m going to take you through a CSPM life cycle, based on what I developed back in the days. I took Forseti, an open-source product of Google, and put it in another infrastructure based on the bank needs. It’s not a big thing to have or it’s not very genius to create it. It’s very simple.
First of all, we have the inventory. The inventory is all of the information that we have about our infrastructure. What project or account we have. What IAM roles we have. What services we have. What policies we have. Think of any attribute of the cloud infrastructure. Let’s say my mom always tells me, leave the door locked. That’s a fact. That’s my information. My door is in my inventory. The scanner is the fact that we are taking all of the changes that happen on the API based on an existing data and identifying, what has happened now? Is that now an IP address that is going to the internet? Is that something that’s currently a role that is super permissive? When we’re scanning that over time, we are able to see changes.
For example, before I go to sleep, if someone would go and scan my door, they would figure, it was locked before, it’s not locked now. Then we have the explain part, which is more of like the tech, that’s what I change. We are basically going to take what has been scanned and going back to the inventory and understand, based on the risks that the security team put for us, what are our measurements that help us to identify the security threats? Then we are going to notify. Notify can be the alert that you’re getting to Slack. That can be the emails that someone’s been getting. That can be our report for the headcount and the executives. That’s the way that we are able to take those things and bring them back to you in order to fix. Then, we have a part that is not mandatory at all, and it’s slightly risky, it’s called enforce. Enforce basically allow us to, for example, set desired template for how we want the situation to be changed.
For example, if I had an automated lock on my door, and I was going to sleep without knowing, and it would figure it’s not currently locked, it would lock it for me. What’s the danger in that? Probably not in my door lock. Let’s say that if there was a public-facing bucket that was accessed again and again, and privileges were changed accordingly, there might be a risk that if we’re enforcing, we are missing something here. Let’s say someone, an internal malicious, that always change the role.
Asking the right questions. I’m not expecting from you guys to basically develop those things. If you would want, we will first focus on what, where, and when. What is the data that we need here? What is the data that we are looking for? Are we looking to get information about our data lakes? Are we looking to get information about our IAM roles, about our projects, about our accounts, about our billing? Then, what will be detected? Basically, what we are looking to detect here. Whether the buckets are not encrypted. Whether it’s the IAM roles are too permissive, and what standards we should follow.
Basically, we need to go back to the security team and ask them, are we able to basically prevent it this way? What is your desire? Maybe something that’s important to the security team, and not as important to us. We need to bridge over this. We need to make sure that we are aligning our requirements with the security team requirements. Eventually, me as a security manager, I don’t want to block you. I don’t want to make sure that you would not be able to work.
Moreover, I don’t want to get you to the point that I am dismissing live services or taking them down because of security concerns. Then the where, where all of this information exists, where in the metadata we’re taking a look at. What credentials, for example, can be stored and where? Let’s say, for example, keys can be as an access key, those can be in our GitLab environments. Those can be in an HSM. There can be an external safe. Where the sensitive data are stored. It might be that we would say, let’s protect all of the production environments, but, actually, the information exists in some staging environment. The when, when I start scanning, that’s something important. How much I’m exposing myself to risk. When security testing is preceded, for example, based on the last pen testing, what I’m going to review here. When we shall define misconfiguration as an incident. I’m taking you to my role today. Let’s say, if a bucket wasn’t encrypted, it’s fine. It’s just a concern. There are vulnerabilities that would be able to get information from this bucket.
Eventually, I want to know when the bucket is misconfigured. When someone was already doing a cp command, or if there’s already a vulnerability that’s currently impacting the market, or triggers something for companies like Deliveroo. How often the scanning will take place, what would be the duration? Remember the former step of the scanner? Basically, on the scanner itself, we need to measure. Are we measuring every hour? Are we measuring every five minutes? What is the most effective thing in order to fix those misconfigurations? It all requires collaboration. As I said, I don’t expect you to do all of this. There are IT professionals who are setting those environments. There is a senior management that needs to lead us about what we are going to do next. There are the security teams, such as me, that give you the guidance. There are the software engineers, which is you, which help us.
How?
The how, let’s take a look at a CSPM alert. That’s about a public exposed VM or serverless with high and critical security network vulnerability with a known exploit and access to sensitive data. It sounds big. Who has ever got this alert? No one, because you’re probably getting that to a Slack channel that you’re hiding. You’re not able to take a look on that. It sounds too big. It sounds too concerning. Most likely that in my company, my SOC team will take a look on that and probably will follow a playbook and fix it by itself. If we’ll deep dive, can you see the Lambda function over here? There are an IAM role that connects to several buckets. Several buckets are communicating with this Lambda function. It’s still connecting to the Lambda function.
Then we have two application endpoints that connects that to the internet with an ingress traffic. Also, this scanner is able to identify three CVEs. CVEs is basically the layout of how a vulnerability is looking like. It’s able to tell me, there are things that might exploit this Lambda function. Let’s say I now clicked on this alert, saw that that exists on an existing CSPM solution. After I’ve seen the security graph, if I’m taking a look, I see those two API gateways with access to any/any from an ingress perspective, communicating on port 443 with this Lambda function. You can now jump into the console. Probably if you would triage that, you will go to the console. I don’t always have all of the information here.
Sometimes there is information such as the security group or the traffic, a lot of things that I already mentioned. How would I fix it if I want to do that automatically? Or, what if I want to get more information about that? How was this tool able to identify that? That’s the API. Remember that we went to the inventory and scanner, that’s what the CSPM presents to me. That’s the information it takes and it scans on a periodic time basis, and the rule that was set, the VM and serverless with a vulnerability and access to this data was looking at. First of all, it was looking at the Lambda function. This is not the actual API scanning. We can see here, for example, that if it was any/any on the source IP, that was something that was changed across time, if the IP address of the API endpoint was configured differently.
After we went through all of that, we want to measure the following misconfigurations, correct? How do we measure them? First of all, we have a few frameworks that we want to focus on. We want to make sure that we’re able to look at that. Let’s say this alert that I’ve just demonstrated. How was this alert identified? How they were able to say, this is a security concern. There can be several things, several frameworks that help us. For example, from a real-time perspective, we have MITRE ATT&CK, with a lot of different techniques that attacker and malicious entity can leverage. If there is actually a vulnerability within our environment, there are additional frameworks.
For example, GDPR, that is more of a regulation from a compliance perspective over our infrastructure, and tell us what can be done and what cannot be done from a regional and country perspective. There is also NIST that’s basically optimizing our security framework, and was set by the National Institute of Standards and Technology within the U.S. Basically, those frameworks are just giving us guidelines. They don’t necessarily know what’s right for us. Remember I said, asking questions and collaborate? Apple is not the same Apple in another company. For example, in your guy’s company, there might be different concerns than exists at Deliveroo.
For example, when I was working at the bank, the data was the most important thing: the data, the money. Today at Deliveroo, also the data is important, but we are a relatively new IPO company. The brand and the stock is also important as well. It’s always good to take a look on CSP best practices, pen testing, and CSPM rule sets. What does it mean? That those frameworks don’t have to be the layout that we need. We can also focus on the pen testing to call an external red team that would identify if things are exploitable. Also, go to the guidance that our cloud infrastructure gives us. To focus also on CSPM rule sets to make sure that those products, I’ve showed, for example, Wiz, how those rule sets work and what works for us, and what doesn’t. That can be debated within our company.
The three main cloud solutions that exist today. Why do I show that to you? There are a lot of SaaS solutions, there is Wiz, there is Orca, there is Prisma, there are a lot of CSPM solutions as of today. Not always, I believe, it might be that you’re working for small companies. You could leverage what already exists in the cloud infrastructure. If you’re working on Google Cloud, you have Security Command Center. If you’re working on Azure, you have Azure Security Center. If you work on AWS, you have AWS Security Hub. You don’t have to enable all of the things, you need to enable what works for you.
Summary
What does all of that give us from a security perspective? First of all, it gives us visibility. It lets us know what exists in our cloud infrastructure way better, not just how our solution works, or our code and application are in place, but how is the layout of our environment, and where a vulnerability exists in order to find those small threats and prevent them in practice. The second thing is misconfiguration monitoring. We want to see how our data over time changes, and make sure that we are identifying potential root causes for an ongoing incident that we have.
For example, if I’m always getting a DDoS attack on one of my frontend sites, I want to find, what is the root cause? I want to find where it exists and fix it quickly. Also, you remember the enforcement part? That’s something that can be enforced. If I’m identifying the scope and the range of those solutions, I am able to prevent that for the long-term. I can configure that on my WAF, my Web Application Firewall, whichever solution that you’re using today, and make sure that those malicious entities, they’re trying to take my frontend down, are no longer doing that.
The third thing is incident response. It’s basically alerting over a misconfiguration to remediate actual breaches. You see that, for example, let’s say that you have configured now your application and you just click next, next, next, and something happened. You have forgot from that already, but it’s not necessarily your guy’s fault, but it might be that a malicious entity was able to leverage this vulnerability and they’re actually now copying all of the data from this bucket or from this server in your applications. It gets to us, to the security operations and response team, and we are starting to triage it. If we wouldn’t have this data, if we wouldn’t have this security graph, if we were not able to take a look at that, we would not be able to solve it effectively. That’s why it’s so important from our protection perspective.
Then we have DevOps integration. One of the benefits that we have for that is the ability to create automated templates that will be in place and remediate automatically threats when identified. For example, let’s say that as of today, I’m already knowing that I have an infrastructure, that it’s problematic, that there were many security breaches, I can work with you guys, if you are DevOps guys, and remediate threats when they have been identified. I can, for example, tell you, can you configure a template basically for AWS buckets, or for an IAM role that whenever a change was made and a role was too permissive, it will change it accordingly. Or, for example, that I can tell from a DevOps integration from a security perspective, I want to make sure, if I put those roles in place, that no one would be able to get into my infrastructure and change it.
Then, let’s say your boring part, and the thing that we have to do and good that we do, it’s good for our risk assessment. It helps us identify what are the things that are concerns from a security perspective. We may have auditors looking on our companies. We may have a security risk management team that’s going to bug us and tell us, all of our buckets are encrypted, all of your IAM roles are there. We need to make sure that those are more efficient and eventually not taking a lot of time from you guys, to let’s say now you’re currently working on your most amazing project in life and now someone from audit tells you, I want you to show me all of this infrastructure. That’s a lot.
If we will have this CSPM solution in place, I would be able to take all of this information to give the actual stats to the risk team or to the audit team and show them, this is the amount of buckets that are misconfigured. These are the IAM roles that are currently too permissive. These are the EC2 instances that currently are not encrypted as well. These are the data lakes that can be communicating with the internet without even knowing. That’s how we measure basically how and what protection we need to have in place.
Remember I told you that it’s not your fault and it’s not something that you have to be responsible for, but it will help you. Eventually, what I’m trying to say here that it’s all really in your hands. It doesn’t matter if you’re directly or indirectly configuring things with the security team, with the IT teams, communicating with the senior management. You have to make sure that you are operating securely in the first place, and that might require a minute or a second from your time to be aware, to make sure that you’re actually putting that in place.
If you’re already involved in that, that’s amazing. We need to make sure that we have that in place in order to put our org in a more secure perspective. That will create a massive impact to your work. That’s super important. Hence, next time when you’re getting to that and you’re looking to get more information and you’re maybe configuring a service, and you’re like, “I don’t know. Have I just clicked all of the things? Have I just raised an environment for testing and left it over?” Eventually, it’s something that can harm your data. It’s something that can bring to the next breach. Just get into this infrastructure. Just review that again. Go to those alerts that the security team are bugging you about. It’s important. First of all, so they would not bother you again. Second, in order to get more information to you and to them, and explore that better. There are additional resources.
Questions and Answers
Losio: You mentioned Slack, and you might probably have those alerts. When I think about cloud security and it’s in our hands to keep it secure, how do you keep your team or more so the team of software engineers on the topic in the long term? I get it, to work on it. I get the idea to think about it, but then I become a bit complacent. It’s like after maybe the first week, I look at those alerts, the second week, yes. After three months, the entire team that are not in the security space tend to, “Let’s archive them. I have too many emails in my inbox or I have too many notifications, let’s archive that”. How do you keep someone in the long term?
Sudai: The collaboration and the fixing? First of all, there are a few methods to that. It can be that we, the security engineering team will put enforcement and will tell you, by tomorrow, if that’s not fixed, we are enforcing and that’s being done, those machines are down, those IAM roles are down, or everything is being changed automatically, just in order for you to run and do that. That’s not the right approach. That’s the scary approach. We really want to create a bridging between those worlds. My most effective collaborations with engineering teams were the ones that I didn’t put myself as the main thing, as the main reason. I sat with them and I was communicating with them, and I tried to understand, what are the most important things to you? Why do you need this application to run with public-facing exposure? Why do you need those IAM roles to be configured this way? Then I would get it. Then we will work together. Because if I would explain, as much as I have explained to you, that if you put this configuration in place, you will have much more problems later, and that can bring to X, to Y, to Z. My focus here was slightly on basic security terms, but it can be deeper.
If we can bring that to you with collaboration and help you understand that, and even leverage you as engineers to create code-based solutions, to create templates and gold images and baselines that can be effective for the long term, that’s something that sustains. That’s what creates a better org culture, more than everything. If the next time the security engineer pings you, you have to fix it, and you’re like, I don’t want. You better do this collaboration. You better understand why they want to do that. It’s sometimes worse to put that first, because if you will fix that once, if you will put an automated model once, it might be changing your infrastructure for the long term from a security perspective.
See more presentations with transcripts