Transcript
Stern: We’re going to talk about topologies for cost-saving autoscaling. Just to get you prepared, it’s not going to be like I’m showing you, this is how you’re going to autoscale your environment, but rather ways to think about autoscaling, and what are the pitfalls and the architecture of OpenSearch that limit autoscaling in reality. I’m going to start talking about storing objects, actual objects, ice-core samples. Ice-core samples are these cylinders drilled from ice sheets or glaciers, and they provide us a record of Earth’s climate and environment.
The interesting thing, I believe, and relevant to us in these ice-core samples is that when they arrive at the storage facility, they are parsed. If you think about it, this is probably the most columnar data of any columnar data that we have. It’s a literal column. It’s sorted by timestamp. You have the new ice at the top and the old ice at the bottom. The way the scientific community has decided to parse this data is in the middle of the slide. It’s very clear to them that this is how they want it parsed. This person managing the storage facility is going to parse the data that way, all of it. Because the scientific community has a very narrow span of inquiry regarding this type of data, it is easy to store it. It is easy to make it very compact. You can see the storage facility is boring. It’s shelves. Everything is condensed. A visitor arriving at the facility has an easy time looking for things. It’s very well-sorted and structured.
If we take a hypothetical example of lots of visitors coming, and the person here who is managing the storage facility wants to scale out, he wants to be able to accommodate many more visitors at a time. What he’ll do is he’ll take all those ice-core samples, and cut them in half. That’s just time divided by two. That’s easy. Add a room. Put all these halves in another room. You can spread out the load. The read load will be spread out. It really makes things easy. It’s practically easy to think about how you’d scale out such a facility. Let’s talk about a different object storage facility, like a museum, where we don’t really know what kind of samples are coming in. If you have a new sample coming in, it could be a statue, it could be an archaeological artifact, it could be a postmodern sculpture of the Kraken or dinosaur bones. How do we index these things in a way that they’re easy to search? It’s very hard.
One of the things that’s interesting is that a visitor at a museum has such a wide span of inquiry. Like, what are they going to ask you? Or a person managing the museum, how does he index things so that they’re easily queryable? What if someone wants the top-k objects in the museum? He’ll need these too, but they’re from completely different fields. When your objects are unstructured, it’s very hard to store them in a way that is scalable. If we wanted to scale our museum for this hypothetical situation where many visitors are coming, it’s hard to do. Would we have to take half of this dinosaur skeleton and put it in another room? Would we take samples from each exhibit and make a smaller museum on the side? How do you do this? In some real-world cases, there’s a lot of visitors who want to see a specific art piece and it’s hard. How do you scale the Mona Lisa? You cannot. It’s just there and everybody is going to wait in line and complain about it later.
Similarly, to OpenSearch, you can scale it. That’s adding nodes. It’s a mechanical thing. You’re just going to add some machines. Spreading the load when your data is unstructured is difficult. It’s not a straightforward answer. This is why in this particular type of system and in Elasticsearch as well, you don’t have autoscaling native to the software.
Background
I’m Amitai Stern. I’m a member of the Technical Steering Committee of the OpenSearch Software Foundation, leading the OpenSearch software. I’m a tech lead at Logz.io. I manage the telemetry storage team, where we manage petabytes of logs and traces and many terabytes of monitoring and metric data for customers. Our metrics are on Thanos clusters, and everything else that I mentioned, logs and traces, are all going to be stored on OpenSearch clusters.
What Is OpenSearch?
What is OpenSearch? OpenSearch is a fork of Elasticsearch. It’s very similar. It’s been a fork for the last three years. The divergence is not too great. If you’re familiar with Elasticsearch, this is very much relevant as well. OpenSearch is used to bring order to unstructured data at scale. It’s the last line over here. It is a fork of Elasticsearch. Elasticsearch used to be open source. It provided an open-source version, and then later they stopped doing that. OpenSearch is a fork that was primarily driven by AWS, and today it’s completely donated to the Linux Foundation. It’s totally out of their hands at this point.
OpenSearch Cluster Architecture
OpenSearch clusters are monolithic applications. You could have it on one node. From here on in the talk, this rounded rectangle will represent a node. A node in OpenSearch can have many roles. You can have just one and it’ll act as its own little cluster, but you could also have many and they’ll interact together. That’s what monolithic applications are. Usually in the wild, we’ll see clusters divided into three different groups of these roles. The first one would be a cluster manager. The cluster manager nodes are managing the state where indexes are, creating and deleting indexes. There’s coordinating nodes, and they’re in charge of the HTTP requests. They’re like the load balancer for the cluster. Then there’s the data nodes. This is the part that we’re going to want to scale. Normally this is where the data is. The data is stored within a construct called an index. This index contains both the data and the inverted index that makes search fast and efficient.
These indices are split up, divided between the data nodes in a construct called a shard. Shards go from 0 to N. A shard is in fact a Lucene index, just to make things a little bit confusing. You already used the term index, so you don’t need to remember that. They’re like little Lucene databases. On the data nodes are three types of pressure if you’re managing one. You’re managing your cluster. You have the read pressure, all the requests coming in to pull data out as efficiently as possible and quickly as possible, and this write pressure of all these documents coming in. There’s the third pressure when you’re managing a cluster, which is the financial one. Since if your read and writes are fairly low, you’ll get a question from your management or from the CFO like, what’s going on? These things cost a lot of money: all this disk space, all the memory, and CPU cores. It’s three types of pressure.
Why Autoscale?
Let’s move on to why would you even autoscale? Financially, cluster costs a lot of money. We want to reduce the amount of nodes that we have. What if we just had enough to handle the scale? This blue line will be the load. The red line is the load that we can accommodate for with the current configuration. Leave it vague that way. Over-provisioned is blue, and under-provisioned is the red over there. If we said the max load is going to be x, and we’re just going to say, we just provision for there. We’ll have that many nodes. The problem would be that we’re wasting money. This is good in some cases if you have the money to spend. Normally, we’re going to try and reduce that. You opt for manual scaling. Manual scaling is the worst of both worlds. You wait too long to scale up because something’s happening to the system. It’s bad performance. You scale up.
Then you’re afraid to scale down at this point because a second ago, people were complaining, so you’re going to wait too long to scale down. It’s really the worst. Autoscaling is following that curve automatically. That’s what we want. This is the holy grail, some line that follows the load. When we’re scaling OpenSearch, we’re scaling hardware. We have to think about these three elements that we’re scaling. We’re going to scale disk. We’re going to scale memory. We’re going to scale CPU core. These are the three things we want to scale. The load splits off into these three. Read load doesn’t really affect the disk. You can have a lot of read load or less read load. It doesn’t mean you’re going to add disk. We’re going to focus more on the write load and its effects on the cluster, because it affects all three of these components. If there’s a lot of writes, we might need to add more disk, or we might need more CPU cores because the type of writing is a little more complex, or we need more memory.
Vertical and Horizontal Scaling
I have exactly one slide devoted to vertical scaling because when I was going over the talk with other folks, they said, what about vertical scaling? Amazon, behind the scenes when they’re running your serverless platform, they’re going to vertically scale it first. If you have your own data center, it could be easy to do that relatively. Amazon do this because they have the capability to vertically scale easily. If you’re using a cloud, it’s harder. When you scale up, usually you go from one set of machines to the next scale of machines. It means you have to stop that machine and move data. That’s not something that’s easily done normally. Vertically scaling, for most intent and purposes in most companies, is really just the disk. That is easy. You can increase the number of EBS instances. You can increase the disk over there. Horizontal scaling is the thing you need to know how to do.
If you’re managing and maintaining clusters, you have to know how to do this. OpenSearch, you just have to add a node, and it gets discovered by the cluster, and it’s there. Practically, it’s easy. There’s a need to do this because of the load, the changing load. There’s an expectation, however, that when you add a node, the load will just distribute. This is the case in a lot of different projects. Here, similar to the example with the museum, it’s not the case. You have to learn how the load is spread out. You have to actually change that as well. How you’re maintaining the data, you have to change that as you are adding nodes.
If the load is disproportionately hitting one of the nodes, we call that a hotspot. Any of you familiar with what hotspots are? You get a lot of load on one of those nodes, and then writes start to lag. Hotspots are a thing we want to avoid. Which moves us into another place of, how do we actually distribute this data so it’s going to all these nodes in the same fashion and we’re not getting these hotspots? When we index data into OpenSearch, each document gets its own ID. That ID is going to be hashed, and then we’re going to do a Mod of N. N being the number of shards. In this example, the Mod is something that ends with 45, and Mod 4, because we have 4 shards. That would be equal to 1, so it’s going to go to shard number 1. If you have thousands of documents coming in, each with their own unique ID, then they’re going to go to these different shards, and it more or less balances out. It works in reality.
If we wanted to have the capability to just add a shard, make the index just slightly bigger, why can’t we do that? The reason is this hash Mod N. If we were to potentially add another shard, our document is now stored in shard number 1, and we want it to scale up, so we extended the index just a bit.
The next time we want to search for that ID, we’re going to do hash Mod to see where it is. N just changed, it’s 5 and not 4. We’re looking for the document in a different shard, and it is now gone. That’s why we have a fixed number of shards in our indices. We actually can’t change that. When you’re scaling OpenSearch, you have to know this. You can’t just add shards to the index. You have to do something we call a rollover. You take the index that you’re writing to, and you add a new index with the same aliases. You’re going to start writing to the new index atomically. This new index would have more shards. That’s the only way to really increase throughput.
Another thing that’s frustrating when you’re trying to horizontally scale a cluster is that there’s shared resources. Each of our data nodes is getting hit with all these requests to pull data out and at the same time to put data in. If you have a really heavy query with a leading wildcard, RegEx, something like that, hitting one or two of your nodes, the write throughput is going to be impacted. You’re going to start lagging in your writes. Why is this important to note? Because autoscaling, often we look at the CPU and we say, CPU high at nodes. That could be because of one of these two pressures. It could be the write pressure or the read. If it’s the read, it could be momentary, and you just wasted money by adding a lot of nodes.
On the one hand, we shouldn’t look at the CPU, and we might want to look at the write load and the read load. On the other hand, write load and read load could be fine, but you have so many shards packed in each one of your nodes because you’ve been doing all these rollover tasks that you get out of memory. I’m just trying to give you the feeling of why it’s actually very hard to do this thing where you’re saying, this metric says scale up.
Horizontally Scaling OpenSearch
The good news is, it’s sometimes really simple. It does come at a cost, similarly to eating cake. It is still simple. If the load is imbalanced on one of those three different types, disk, memory, or CPU, we could add extra nodes, and it will balance out, especially if it’s disk. Similarly, if the load is low on all three, it can’t be just one, on all three of those, so low memory, low CPU, low disk, we can remove nodes. That’s when it is simple, when you can clearly see the picture is over-provisioned or under-provisioned. I want to devote the rest of the talk to when it’s actually complicated because the easy is really easy. Let’s assume that we’re looking at one of those spikes, the load goes up and down. Let’s say we want to say that when we see a lot of writes coming in, we want to roll over. When they go down, we want to roll over again because we don’t want to waste money. The red is going to say that the writes are too high. We’re going to roll over.
Then we add this extra node, and so everything is ok. Then the writes start to go down, we’re wasting money at this point. There’s 20% load on each of these nodes. If we remove a node, we get a hotspot because now we just created a situation where 40% is hitting one node, a disproportionate amount of pressure on one. That’s bad. What do we do? Do another rollover task, and now it’s 25% to each node. We could do this again and again on each of these. If it’s like a day-night load, you’d have way too many shards already in your cluster, and you’d start hitting out of memory. Getting rid of those extra shards is practically hard. You have to find a way to either do it slowly by changing to size-based retention, or you can do merging of indexes, which you can do, but it’s very slow. It takes a lot of CPU.
Cluster Topologies for Scaling
There is a rather simple way to overcome this problem, and that is to overshard. Rather than have shards spread out one per node, I could have three shards per node. Then, when I want to scale, I’ll add nodes and let those shards spread out. The shards are going to take up as much compute power as it can from those new nodes, so like hulking out. That’s the concept. However, finding the sweet spot between oversharding and undersharding becomes hard. It’s difficult to calculate. In many cases, you’d want to roll over again into an even bigger index. I’m going to suggest a few topologies for scaling in a way that allows us to really maintain this sweet spot between way too many, way too few shards. The first kind is what I’d call a burst index.
As I mentioned earlier, each index has a write alias. That’s where you’re going to atomically be writing. You can change this alias, and it’ll switch over to whatever index you point to. It’s an important concept to be familiar with if you’re managing a cluster. What we’re suggesting is to have these burst indices prepared in case you want to scale out. They can be maintained for weeks, and they will be that place where you direct traffic when you need to have a lot of it directed there. That’s what we would do. We just change the write alias to the write data stream. That would look something like this. There’s an arbitrary tag, a label we can give nodes called box_type. You could tell an index to allocate its shards on a specific box_type or a few different box_types. The concept is you have burst type, the box_type: burst, and you have box_type: low.
As long as you have low throughput in your writes, and again, that is probably the best indicator of I need more nodes, is the write throughput. If we have a low throughput on the writes, we don’t need our extra nodes. The low write throughput index is allocated to indexes that have the low box_type. If throughout the day the throughput is not so low and we anticipate that we’re going to have a spike, and this, again, it’s so tailored to your use case that I can’t tell you exactly what that is. If you see, in many cases, it is that the write throughput is growing on a trend, then what you do is you add these extra nodes. You don’t need to add nodes that are as expensive as the other ones. Why? Because you don’t intend to have that amount of disk space used on them. They’re temporary. You could have a real small and efficient disk there on these new box_types. You create the new ones. The allocation of our burst index says it can be on either low or burst or both. All you have to do is tell that index that you’re allowed to have total shards per node, 1.
Then it automatically will spread out to all of these nodes. At this point, you’re prepared for the higher throughput, and you switch the write alias to be the high throughput index. This is the burst index type. As it goes down, you can move the shards back by doing something called exclude of the nodes. You just exclude these nodes, and shards just fly off of it to other nodes. Then you remove them. This is the first form of autoscaling. It works when you don’t have many tenants on your cluster. If you have one big index, and it may grow or shrink, this makes sense.
However, in some cases, we have many tenants, and they’re doing many different things all at the same time. Some throughputs spike, when others will go down. You don’t want to be in a situation where you’re having your cluster tailored just for the highest throughput tenant. Because then, again, you are wasting resources.
Which brings me to the second and last topology that I want to discuss here, which is called the burst cluster. It is very similar to the previous one, but the difference is big. We’re not just changing the index that we’re going to within the cluster, we’re changing the direction to a completely different cluster. We wouldn’t be using the write alias, we would be diverting traffic. It would look something like this. If each of these circles is a cluster, and each of them have that many nodes, why would we have a 10, and a 5, and a 60? The reason is we’d want to avoid hotspots. You should fine-tune your clusters initially for your average load. The average load for a low throughput might be 5 nodes, so you want only 5 shards. For a higher throughput, you want a 10-node cluster, so you have 10 shards each. If you’re suffering from hotspots, all you have to do to fix that is spread the shards perfectly on the cluster. That means zero hotspots.
In this situation, we’ve tailored our system so that on these green clusters, the smaller circles, they’re fine-tuned for the exact amount of writes that we’re getting. Then one of our tenants spikes while the others don’t. We move only that tenant to send all their data, we divert it to the 60-node cluster, capable of handling very high throughputs, but not capable of handling a lot of disk space. It’s not as expensive as six times these 10-node clusters. It is still more expensive. Data is being diverted to a totally different environment. We use something called cross-cluster search in order to search on both. From the perspective of the person running the search, nothing has changed at any point. It’s completely transparent for them.
In terms of the throughput, nothing has changed. They’re sending much more data, but they don’t get any lag, whoever is sending it. All the other tenants don’t feel it. There are many more tenants on this 10-node cluster, and they’re just living their best life over there. You could also have a few tenants sending to this 60-node cluster. You just have to manage how much disk you’re expecting to fill at that time of the burst. A way to make this a little more economical is to have one of your highest throughput tenants always on the 60-node cluster. You still maintain a reason to have them up when there’s no high throughput tenants on these other clusters. This is a way to think of autoscaling in a way that is a bit outside of the box and not just adding nodes to a single cluster. It is very useful, if you are running a feature that is not very used in OpenSearch, but is up and coming, called searchable snapshots.
If you’re running searchable snapshots, all your data is going to be on S3, and you’re only going to have metadata on your cluster. The more nodes you have that are searching S3, the better. They can be small nodes with very small disk, and they could be searching many terabytes on S3. If you have one node with a lot of disk trying to do that, the throughput is going to be too low and your search is going to be too slow. If you want to utilize these kinds of features where your data is remote, you have to have many nodes. That’s another reason to have such a cluster just up and running all the time. You could use it to search audit data that spans back many years. Of course, we don’t want to keep it there forever.
A way to do that is just snapshot it to S3. Snapshots in OpenSearch are a really powerful tool. They’re not the same as they are in other databases. It takes the index as it is. It doesn’t do any additional compression, but it stores it in a very special way, so it’s easy to extract it and restore a cluster in case of a disaster. We would move the data to S3 and then restore it back into these original clusters that we had previously been running our tenants on. Then we could do a merge task. Down the line, when the load is low, we could merge that data into smaller indexes if we like. Another thing that happens usually in these kinds of situations is that you have retention. Once the retention is gone, just delete the data, which is great. Especially if you’re in Europe, you have to delete it right on time. This is the burst cluster topology.
Summary
There are three different resources that we want to be scaling. You have to be mindful when you’re maintaining your cluster which one is the one that causes the pressure. If you have very long retention, then disk space. You have to start considering things like searchable snapshots or maintaining maybe a cross-cluster search where you have just data sitting on a separate cluster that’s just accumulating in very large disks, whereas your write load is on a smaller cluster. That’s one possibility. If it’s memory or CPU, then you would definitely have to add stronger machines. We have to think about these things ahead of time. Some of them are a one-way door.
If you’re using AWS and you add to your disk space, in some cases, you may find it difficult to reduce the disk space again. This is a common problem. When I say that, the main reason it is is because when you want to reduce a node, you have to shift the data to the other nodes. In certain cases, especially after you’ve added a lot of disk, that can take a lot of time. Some of them are a one-way door. Many of them require a restart of a node, which is potential downtime. We talked about these two topologies, I’ll remind you, the burst index and the burst cluster, which are very important to think about as completely different options. I like to highlight that that first option that I gave, the hulking out, like the oversharding proposition, is also viable for many use cases.
If you have a really easy trend that you can follow, your data is just going up and down, and it’s the same at noon. People are sending 2x. Midnight, it goes down to half of that. It keeps going up and down. By all means, have a cluster that has 10 nodes with 20 shards on it. When you hit that afternoon peak, just scale out and let it spread out. Then when it gets to the evening, then scale down again. If that’s your use case, you shouldn’t be implementing things that are this complex. You should definitely use the concept of oversharding, which is well-known.
Upcoming Key Features
I’d like to talk about some upcoming key features, which is different than when I started writing this presentation. These things changed. The OpenSearch Software Foundation, which supports OpenSearch, one of the things that’s really neat is that from it being very AWS-centric, it has become much more widespread. There’s a lot of people from Uber and Slack, Intel, Airbnb, developers starting to take an interest and developing things within the ecosystem. They’re changing it in ways that will benefit their business.
If that business is as big as Uber, then the changes are profound. One of the changes that really affects autoscaling is read/write separation. That’s going to come in the next few versions. I think it’s nearly beta, but a lot of the code is already there. I think this was in August when I took this screenshot, and it was a 5 out of 11 tasks. They’re pretty much there by now. This will allow you to have nodes that are tailored for write and nodes that are tailored for read. Then you’re scaling the write, and you’re scaling the read separately, which makes life so much more simple.
The other one, which is really cool, is streaming ingestion. One of the things that really makes it difficult to ingest a lot of data all at once is that today, in both Elasticsearch and OpenSearch, we’re pushing it in. The index is trying to do that, trying to push the data and ingest it. The node might be overloaded, in which case the shard will just say, I’m sorry, CPU up to here, and you get what is called a write queue. Once that write queue starts to build, someone’s going to be woken up, normally. If you’re running observability data, that’s a wake-up call. In pull-based, what you get is the shard is hardcoded to listen and retrieve documents from a particular partition in for example, Kafka. It would be pluggable, so it’s not only Kafka.
Since it’s very common, let’s use Kafka as an example. Each shard will read from a particular partition. A topic would represent a tenant. You could have a shard reading from different partitions from different topics, but per topic, it would be one, so shard 0 from partition 0. What this gives us is the capability for the shard to read as fast as it can, which means that you don’t get the situation of a write queue, because it’s reading just as fast as it possibly can, based on the machine, wherever you put it. If you want to scale, in this case, it’s easy. You look at the lag in Kafka. You don’t look at these metrics in terms of the cluster. The metrics here are much easier. Is there a lag in Kafka? Yes. I need to scale. Much easier. Let’s look at CPU. Let’s look at memory. Let’s see if the shards are balanced. It’s much harder to do. In this case, it will make life much easier.
Questions and Answers
Participant 1: I had a question about streaming ingestion. Beyond just looking at a metric, at the lag in Kafka, does that expose a way to know precisely up to which point in the stream this got in the document? We use OpenSearch in a bunch of places where we need to know exactly what’s available in the index so that we can propagate things to other systems.
Stern: It is an RFC, a request for comments.
Participant 1: There’s not a feature yet.
Stern: Right now, it’s in the phase of what we call a feature branch, where it’s being implemented in a way that it’s totally breakable. We’re not going to put that in production. If you have any comments like that, please do comment in the GitHub. That would be welcome. It’s in the exact phase where we need those comments.
Participant 2: This is time-series data. Do you also roll your indexes by time, like quarterly or monthly, or how do you combine this approach with burst indexes with a situation where you have indexes along the time axis.
Stern: If it’s retention-based? One of the things you can do is you have the burst index. You don’t want it to be there for too long. The burst index, you want it to live longer than the retention?
Participant 2: It’s not just the burst indexes, your normal indexes are separated by time.
Stern: In some cases, if your indexes are time-based and you’re rolling over every day, then you’re going to have a problem of too many shards if you don’t fill them up enough. You’ll have, basically, shards that have 2 megabytes inside them. It just inflates too much. If you have 365 days or two years of data, that becomes too many shards. I do recommend moving to size-based, like a hybrid solution of size-based, as long as it’s less than x amount of days, so that you’re not exactly on the date but better. Having said that, the idea is that you have your write alias pointed at the head. Then after a certain amount of time, you do a rollover task. The burst index, you don’t roll over, necessarily. That one, what you do instead of rolling over, you merge, or you do a re-index of that data into the other one. You can do that. It just takes a lot of time to do. You can do that in the background. There’s nitty-gritty here, but we didn’t go into that.
Participant 3: I think you mentioned separation of reading and writing. It’s already supported in OpenSearch Serverless in AWS. Am I missing something? The one that you are talking about, is it going to come for the regular OpenSearch, and it’s not yet implemented?
Stern: I don’t work at AWS. I’m representing open source. Both of these are going to be completely in open source.
Participant 3: That’s what I’m saying. It seems like it’s already there in version, maybe 2.13, 2.14, something like that. You mentioned it is a version that is coming, but I have practically observed that it’s already there, in Amazon serverless.
Stern: Amazon serverless is a fork of OpenSearch. It took a nice amount of engineers more than a year to separate these two things, these concepts of OpenSearch is a multi-application and having to read/write. A lot of these improvements, they’re working upstream. They like to add these special capabilities, like read/write separation. Then they contribute a lot of the stuff back into the open source. In some cases, you’ll have features already available in the Amazon OpenSearch offering, then later, it’ll get introduced into the OpenSearch open source.
Participant 3: The strategies that you explained just now, and they are coming, especially the second one, one with the Kafka thing, is there a plan?
Stern: Again, this is very early stage, the pull-based indexing. That one is at a stage where we presented the API that we imagine would be useful for this. We developed the concept of it’ll be pluggable, like which streaming service you’d use. It’s at a request for comments stage. I presented it because I am happy to present these things and ask for comments. If you have anything that’s relevant, just go on GitHub and say, we’re using it for this, and one-to-one doesn’t make sense to us. If that’s the case, then yes.
Participant 3: It can take about six months to a year?
Stern: That particular one, we’re trying to get in under a year. I don’t think it’s possible in six months. That’s a stretch.
Participant 4: I think this question pertains to both the burst index and the burst cluster solution. I think I understand how this helps for writing new documents. If you have an update or a delete operation, where you’re searching across your old index, or your normal index, and then either the burst index or the burst cluster, and that update or that delete is reflected in the burst cluster, how does that get rectified between those two?
Stern: One of the things you have to do if you’re maintaining these types of indexes, like a burst index, is you would want to have a prefix that signifies that tenant, so that any action you do, like a deletion, you’d say, delete based on these aliases. You have the capability of specifying either the prefix with a star in the end, like a wildcard. You could also give indexes, and it’s very common to do this, especially if it’s time-series data, to give a read alias per day. You have an index, and it contains different dates with the tenant ID connected to them. When you perform a search, that tenant ID plus November 18th, then that index is then made available for search. You can do the same thing when you’re doing operations like get for a delete. You can say, these aliases, I want to delete them, or I want to delete documents from them. It can go either to the burst cluster, or it could go to the indexes that have completely different names, as long as the alias points to the correct one.
The cluster means you have to really manage it. You have to have some place where you’re saying, this tenant has data here and there, and the date that I changed the tenant to be over there, and the date that I changed them back. It’s very important to keep track of those things. I wouldn’t do it within OpenSearch. A common mistake when you’re managing OpenSearch, is to say, I have OpenSearch, so I’m going to just store lots of information in it, not just the data that I built the cluster for. It should be a cluster for a thing and not for other things. Audit data should be separated from your observability data. You don’t want to put them in the same place.
Participant 5: A question regarding the burst clusters, as well as the burst nodes that you have. With clusters, how do you redirect the read load directly? Is the assumption that we do cross-cluster search? With OpenSearch dashboards in particular, when you have all your alerts and all that, and with observability data, you’re acquiring a particular set of indexes, so when you move the data around clusters, how do you manage the search?
Stern: For alerting, it is very difficult to do this if you’re managing alerting using just the index. If you use a prefix, it could work. If you’re doing cross-cluster search, the way that that feature works is that, in the cluster settings, you provide the clusters that it can also search on. Then when you run a search, if you’re doing it through Amazon service, it should be seamless. If you’re running it on your own, you do have to specify, instead of just search this index, it doesn’t know that it has to go to the other cluster. You have to say, within this cluster, and that cluster, and the other cluster, search for this index. You have to add these extra indexes to your search.
Participant 5: There is a colon mechanism where you put in. Basically, what you’re expecting here is, in addition to write, with read, we have to keep that in mind before spinning up a burst cluster.
Stern: You have to keep track where your data is when you’re moving it.
Participant 5: The second part of the question with burst nodes is, I’m assuming you’re amortizing the cost of rebalancing. Because whenever the node goes up and down, so your cluster capacity, or the CPU, because shards are moving around, and that requires CPU, network storage, these transport actions are happening. You’re assuming, as part of your capacity planning, you have to amortize that cost as well.
Stern: Yes. Moving a shard while it’s being written to, and it has already 100 gigs on it, moving that shard is a task that is just going to take time. You need high throughput now. It’s amortized, but it’s very common to do a rollover task with more shards when your throughput is big. It’s the same. You’d anyway be doing this. You’d anyway be rolling over to an index that has more shards and more capability of writing on more nodes. It’s sort of amortized.
Participant 5: With the rollover, you’re not moving the data, though. It’s new shards getting created.
Stern: Yes. We don’t want to move data when we’re doing the spread-out. That really slows things down.
See more presentations with transcripts