Powering Enterprise AI Applications With Data And Open Source Software

Transcript

Francisco Javier Arceo: I am Francisco. I am here to talk to you today about production artificial intelligence. I’ll talk a little bit more about what that is. I am a Senior Principal Software Engineer at Red Hat. I am on the Kubeflow steering committee. I’m a maintainer for Feast, an open-source feature store. I’m going to talk a little bit about some of the challenges that come up with production AI. I spent about 12-plus years building AI/ML products for banks and FinTech in places like Goldman Sachs, Commonwealth Bank of Australia, AIG, Affirm, Fast, my own startup at some point. I joined Red Hat about a year ago to work on open-source AI. I get to work on a mixture of things between distributed training like pipelines, Kubeflow pipelines predominantly, feature store, Feast, RAG, and agents. In my ample free time, I like to write some code.

Context

I wanted to give everybody some context about goals for today. There are three things that I want you to be able to take away from this conversation. One, I want you to understand the value of proprietary data for AI. Data is really the only value add in AI. When Meta unleashed their weights with Llama models, they basically told the entire world that the only valuable thing here is the data. I think people often forget that conclusion with it. Really, the novelty for any enterprise or even any startup is what you can do with the data that you have. Whether that’s in training or in serving, it really comes down to the data. Then that leads to the next part that I want you to take away from, which is understanding the complexity of data. It turns out, for a bunch of reasons that we’ll talk about, working with data for AI products is just really hard. I want you to understand how some powerful open-source frameworks can really help you manage this complexity. That’s the goals for today.

Production Artificial Intelligence

What is production artificial intelligence? I call it the trinity: inference, data, and product. Inference, we all pretty much are pretty familiar with inference these days. A model that generates predictions. Sometimes it’s tokens, generating tokens. That’s how ChatGPT works. Data, inputs that go into the model to make the inference useful. Sometimes you’ll hear about RAG, Retrieval-Augmented Generation. Then that’s using in-context learning to then generate some interesting responses. Sometimes it’s a chatbot. Sometimes that’s like an extraction problem or other sorts of things.

Ultimately, that’s wrapped up in a product. A product is really just whatever experience you have that uses those two things. Chip Huyen, on the right, wrote the AI Engineering book. I highly recommend it to anybody who’s interested in the space. It’s an incredible book. She goes into a lot of depth about what is production AI and AI engineering, and how a lot of it is built off of the learnings we’ve had from ML engineering, which was the predecessor. People bifurcate these things into the old tabular world of predictive ML, and then generative AI. In many ways, everything’s the same, except now we don’t necessarily have to train a model. We have all of the same problems. I think that’s a really important foundation, because all the learnings that we’ve had from the decades prior, we can use.

Why is production AI hard? One, inference is hard, because large-scale LLMs and multi-modal models increase the amount of hardware that we needed. Then, time-to-first token became very important as a metric for getting stuff out there. The good news is, we figured it out. We just threw GPUs at the problem. NVIDIA, obviously, has dominated this place, and other people are emerging. The graph on the right is essentially the flops per second as a function of time. You see that we’ve gotten really good at doing matrix multiplications. That’s really all these GPUs do. Then, data. Data is really hard, because serving data in production is hard, for lots of reasons, mostly non-technical, and then some technical. We’ll get into that. Then, product. Product is super hard, because if you can build a perfect product, nobody uses it. That’s just a really hard business problem in general.

Why should we care? I think this is probably the most important thing I want to say, is that from the old world, there was this often-cited statistic that 87% of data science or AI projects fail. I’ve seen entire teams wiped out by this statistic, because they weren’t able to deliver business value with their exploratory projects. Or they had a million proof of concepts, but nothing in production. When that result persists, eventually, businesses decide that this doesn’t make sense to fund as a side project, because the cost is too high. It’s really critically important that projects get into production as soon as possible, really, in my opinion.

Then you iterate, and learn, and test small, and then expand. It turns out, and there’s a really great article about this in MLOps history that talks about most of the work in production AI is plumbing. Whether it’s setting up CI/CD pipelines, or data pipelines, a lot of it ends up being a lot more of this plumbing work than what you’ll see from the media, by just saying, everyone’s training this huge model. It turns out that’s actually a much more straightforward problem. Again, data is really the competitive advantage. No business impact for a company is without the ability to provide customer value, and that comes from being able to leverage their data.

Then, great data will enable great product. I think that there’s this really great flywheel effect that comes when you can marry these things together, inference, data, and product, that you really start to get this persistent flywheel. You start to get more data and more AI problems as you create really great product experiences. I think Netflix is the canonical example here, where they started with this recommendation engine. Everybody loved it. Here are all these unseen movies I didn’t realize I’d love. Then they became this huge behemoth of a company, where they have AI everywhere.

Why Is Data Hard?

Going back to the point, why is data hard? It turns out just production systems are hard in general. We had several SRE talks because the reliability aspect is challenging in its own right. There are four key areas where I’d say that make data hard, and there’s probably one omitted. The first one is consistency. Training-serving skew is a real thing, a real phenomenon. I’ve worked at a bunch of different places, and training-serving skew has been a reoccurring phenomenon. It’s a pretty simple concept, and I’ll go into it in a little bit. Efficiency is another one. People just reinvent the wheel. That’s a problem. Governance is super boring, but it’s really important as it turns out. Then, reliability, again, maintaining uptime for any service is just really hard. There’s a joke here about databases, I think, is actually useful in practice to know.

Why Is Consistency Hard?

Why is consistency hard? By consistency, we mean, again, training-serving skew. The idea is pretty simple. An ML engineer is going to build a model, and they’re going to write some Python code, probably. They’re going to use Python code to transform some data. That should be more precise. Then in serving that model, they’re going to want to transform that data again. Maybe your production system is written in Java or Scala or something else. You either reimplement this code, or you have some connection to try and call the Python function or something else.

Oftentimes, that first approach of rewriting the code ends up quite brittle. It also doesn’t scale really well. You end up having a lot of production issues where someone forgot to divide by zero, as an example, for some feature. This training-serving skew, it turns out in practice, ends up being a very consequential place. You can imagine that for a financial institution where they’re trying to give out loans, if they got the denomination of the dollar amount wrong, let’s suppose, they had a feature encoded in dollars instead of cents, turns out that’s a really consequential problem. That’s actually an example from two companies I worked for before. That ends up being a really hard problem. Again, different languages in production versus development, what they call it, or model training, ends up being just a very costly challenge. It’s in two ways, really.

One, it’s the reimplementation of the logic. Two, it’s the velocity. What you end up finding is that there’s a lot of business places or business problems where you can have a lot of impact by serving more data in production but you’re bottlenecked by this one or two people who can actually translate this code from Python into Java or whatever. Having this dual implementation problem ends up being a big bottleneck. Then there’s this third point. This one’s a little more nuanced. When developing features for training data, it turns out you can leak data from the future into your model. It’s a weird, nuanced issue, but it actually ends up happening a lot. This ends up having model developers getting really excited about their model, saying like, I’m going to have like $50 million in business impact once we launch this bad boy into production.

Then they go and find out that they leaked data in the future and it turns out their model’s really terrible once they remove it. That’s happened to me. I try not to let it happen anymore. It’s a real thing. There are tools that we can use to handle these things. Suffice it to say, consistency is hard, and not in the distributed system sense.

Why Is Efficiency Hard?

Why is efficiency hard? It turns out people like to reinvent the wheel. I think that the reason might be for, I didn’t code it, so I just want to code it myself. Or, I don’t want to look at somebody else’s code. That tends to be a common thing. The other is less cynical and just more practical, which is, historically, people didn’t really have a centralized store of data. As an example, when I was working at the Commonwealth Bank of Australia, I flew to Sydney twice to find data from somebody to just figure out, how do I calculate exposure of a loan for a customer that has just all of their outstanding loan balances? That seems like a pretty intuitive thing.

If you’re going to give someone a loan, you want to know how much they owe you right now. Again, I had to fly to Sydney, Australia, just to find out which table that was in, because there wasn’t a real central catalog of this information. People would have a spreadsheet that they would manually update at least 8 years ago. That doesn’t scale super well. It gets stale very quickly. Having robust documentation is a really important thing for production use cases.

That feeds into the next one, which is, you make something, you have to make it discoverable. Discoverability is a really important aspect to that. Discoverability in its own right is hard. How do you surface that content, if you have the content, to people? Then, the most important one is complex data integrations. This is a software problem. This is a purely technical problem where you have approximately three different ways to ingest data to a central sync. Having a standardized way for every microservice that you have to essentially push that data into some central store is a really hard problem to get everybody on the same page about it. What you end up having is people re-implementing the same thing a million times, is the joke on the right. It ends up being very costly and confusing, and then just really a waste of time.

Why Is Governance Hard?

Why is governance hard? I think, as a joke, data governance is not a priority. I say that flippantly. Oftentimes, I’ve found product and business priorities to compete with the idea, like, we care about data governance and privacy as a bullet point. Then when it’s like prioritizing in a sprint or whatever, I think then that gets lost somewhere in translation. There are tools that help us with this. I think at any given business, it’s really hard to prioritize data governance unless you’re leveraging something that someone else has built, like some open-source frameworks. Again, that’s just a prioritization issue. I think ownership of data also tends to be very contentious.

In both directions, where some people in some companies, in my experience, want too much control and then they don’t want you to have access to any of their data. Or the reverse where nobody wants to control this data and nobody is an owner. You actually need to have ownership, but in a good partnership with everybody who’s going to be consuming all of that data. Data silos. Again, this is the issue I had where I flew to Australia, this feeds into discoverability, where teams sometimes are, for good reasons, blocked off from other teams from being able to access it. It creates these silos, almost on purpose. It still ends up limiting your ability to be successful in using this data for production use cases.

Then, regulatory concerns. This is a legitimate problem. There are ways to govern access to data without having to create too many artificial constraints. It is a reality that when you’re working in certain industries, you do have to bear in mind the complex regulations and different auditors wanting to be able to see exactly who accessed what. There are approaches to manage that risk as well. All of those things are ultimately why governance is hard.

Why Is Reliability Hard?

Why is reliability hard? High availability of any production system is just really hard. People are changing software all the time. For AI/ML, people are adding new features, people are adding new models constantly. Sometimes it’s brittle, sometimes it’s not, depending on who wrote the code. The reality is that there’s just a lot of velocity. The more velocity in a system, the more likely it is to break. Maintaining low latency is really hard, especially as your database scales. It’s an entire subfield of computer science. I just want to emphasize that achieving low latency at scale for a million reasons can be very challenging. When it’s easy, it’s easy. It takes a lot of work to get it there. Scaling for traffic shocks. Obviously, Kubernetes is a powerful tool and a powerful platform for you to be able to scale horizontally your traffic.

Sometimes you still have to front-run things and prepare for peak traffic. Things can go wrong even during peak traffic, and that can cause some really big consequences during business-critical use cases. Again, there are ways to handle that. Then, fault tolerance. Again, with ML systems in particular, where there’s a lot of hybrid engineering of really rigorous, well-known math, like models when you’re running inference. Then this huge amount of business logic of data ends up resulting in very brittle systems. It can be very challenging. There are straightforward ways to handle fault tolerance.

Feast: The Open-Source Feature Store

What can we do about it? Cue Feast, the open-source feature store. Today’s talk is a little bit about Feast. Use a feature store. Feast and a feature store, there are others, can help with this. I maintain Feast so I can tell you how we handle these things. I think it’s really important. I’m happy to go briefly into the history about Feast. Feast was originally created in collaboration with Google and Gojek, and it was handled or shepherded by Tekton at some point. They’re one of the big private feature stores. Then, I’m a maintainer. I started maintaining it at Affirm, where I had shipped it. Affirm is a checkout company.

Then, I joined Red Hat to work with Feast. I feel very privileged I get to work with the team there. What Feast is, is a tool that does all of these things. It lets you unify your data for serving and training. It creates a catalog for you to centralize, access, and govern your data. It has a centralized metadata registry. It has robust RBAC and permissions models to support enterprise needs. It’s battle tested to support large distributed computing and horizontal scaling needs. Again, my last company was a checkout company. We had partnerships with a lot of people that had a lot of traffic on Black Friday, Cyber Monday. We were able to really scale out well and be thoughtful about these things. It turns out it was a lot of work.

How does it work? Here’s a very high-level diagram. Request Sources. This is like your APIs. Maybe you have a service API that, for logical reasons, like you want strong consistency in your data, you want fully synchronous writes and transformations. This is where a request source or essentially an API call makes a lot of sense. We have stream sources. Event and streaming architectures tend to be pretty popular nowadays, especially for high volume use cases. You can have that.

Then you could have essentially Flink or Spark streaming, transform your data and upsert at a reasonable cadence. You can support batch sources. Batch sources are exactly what you think. You process a billion records and transform them into a million or something. You just want to upsert them into a database using Spark, or Ray, or Daft. Feast can handle that pretty well. Then you would transform that data and you’d store it into both a registry, which is this little box here. Then there’s a database somewhere in between here. You would serve it online. We call it online features.

Then, offline. Offline is for model training. In the tabular world, people spend a lot of time building scorecard models or recommendation engines, whatever. Training is a pretty important part of that workflow. In the fine-tuning era, where you’re taking an LLM, you want to fine-tune it on some of your own proprietary data, the logic still applies. It’s the same. It’s important to understand. For online inference, even with RAG, it’s also still equivalent.

Here’s actually a deeper look of what an architecture looks like. Here, I call it a data producer. A data producer could be like a ledger. You have a ledger that’s taking every payment from a customer, and every deposit from a customer. You could emit an event, let’s say like, this customer just paid their balance or whatever. That event could be consumed by a Kafka topic. Flink could then transform that into a feature that’s windowed over like, the balance over the last seven days. That could then be written to this online database. That’s how you’d serve it for an online customer experience. Let’s say you wanted to actually train that data.

At the same time, you could take that event that was consumed and fire it into S3. Then consume that into your offline store. Offline store could be a data warehouse, like Snowflake. Or it could just be Parquet data in S3, and you could use Spark to just query it. That’s where this training dataset preparation really matters quite a bit. You can do a lot of testing here. This is generally where ML engineers or data scientists spend most of their day with a Jupyter Notebook. This ends up being a lot of work. There’s a lot of business value that actually gets added here where they scope out like, this model is going to add, again, $40 million or something to the profit of the company. You can also do batch upserts, if you run a model in batch, which is a pretty common pattern. That’s how we used to do ML in the old days, quite a while ago. You could say for every user, let’s recommend a bunch of the movies that they want. You could run a nightly job to say, every 24 hours, we’re going to give them the top 20 movie recommendations we want to give them.

Then you could just upload those predictions into the online store. You could just retrieve them at runtime. You could say, give me the features or recommendations. There’s nothing live about that inference at all. Pre-computing ends up being very powerful, especially to achieve low latency. It turns out that if you really want low latency AI applications, then you actually want to avoid inference if you can. Streaming architectures and batch models end up being a really good choice.

Sometimes the data doesn’t change so much and so often that you really need every second to do these things. That’s how that works. This example is what things look like today where if you call to get inference at runtime, sometimes the inference provider could say, let me go get the data. Then it’ll call that. You could also flip it where you can call this feature store and then get inference. There are tradeoffs to both ways, which is important to know. Depending on whether you’re optimizing for data freshness or latency, you might choose differently.

Demo (Code)

How does it work? I’m going to walk through a very shallow demo here. We have a few different primitives. One is, we call things entities. These are just primary keys. Like here, this is a RAG example, actually, where I’m going to have a chunk ID. What you do with RAG is you can take a big document of text and chunk it up, partition it into sentences as an example. Let’s say you have 10 sentences, and you could say that each sentence has a chunk ID with it. That’s what you’re going to use when you query in real time that data. You can embed this kind of chunk and store it somewhere else. You would declare some metadata here. There’s a field parameter here, which is just telling you the string type. Again, this is the chunk entity. This is the document entity. You can have multiple primary keys. It’ll combine into a composite key in this table thing. This is the data source and it’s just a file. You can just upload a Parquet file. You can also pass through like BigQuery connections or Snowflake connections and other sorts of things.

For simplicity, we’re just starting with Parquet. This is a feature view. A feature view is basically just a table. It’s an alias for a table. Here we specify the entity, which is the chunk in this case. The field is a file name and then the raw markdown. For reasons I’ll get into for this demo, I’ll explain it. Then this field is called vector. Basically, in order to enable vector similarity search for retrieval-augmented generation, you add these two flags. Actually, you only have to add one, which is vector search true. Then the other one passes the distance metric, which is cosine similarity in this case. Then these are some other. This is the data source, which is pointing to here.

Then this is a TTL parameter. You don’t have to worry about that. That’s basically it. Once your MLOps engineer deploys it and we have an operator, you can deploy this on Kubernetes and maintain the lifecycle of the application. This is basically the application developer or the machine learning engineer, that’s going to build this. This is what they’re going to actually write. I think that’s really important because for a long time, there was this discrepancy between a data scientist getting their model into production and not knowing how and not having a common language that they could speak with a software engineer to get that into production. This is really it. This enables everybody to get RAG in their production, which is, I think, pretty exciting.

This piece is actually what’s going to do a transformation. Here, this is some additional metadata. The previous example, there’s no transformation here. This is just going to upload a file that’s pre-computing embeddings. You have some batch process that just splits apart your documents, embeds them, and then stores them in a Parquet file. This just loads it. Then you can do vector similarity search. What if you wanted to use Feast itself to do that transformation? You want to make that transparent to your catalog. You want to make other people aware of how to do this stuff. You could do this with what’s called a feature view. It’s called an on-demand feature view, which basically says, you can transform this on-demand. You make an API call, and it’s going to transform it in the feature server. Here, this is using an open-source tool called Docling. It’s a fun tool.

If you’re familiar with Reducto, it’s like a private version of it. Docling is an open-source framework. It runs a bunch of BERT models, small vision and NLP models, to extract text from PDFs. Here, I’m sending, in this input request thing, bytes of a PDF file. It’s transforming this data on the fly, and then extracting each sentence chunk as markdown, and then adding the embedding along with it, so that you can do vector similarity search. It returns it in a dictionary. What’s really nice about this is what you serve in real time in this API can be consistent with what you run in a batch engine like Spark.

If you had like a million PDFs to process, you’d have symmetry between what you want at real time in an application where you want low latency for a customer experience, and in a batch to just run a whole bunch. You don’t have this training-serving skew problem I mentioned before. You don’t have to re-implement any code. You can just use basically this user-defined function, UDF in Spark. It has this decorator syntax and another chunk here. It’s pretty much the same. That ends up being really powerful. You could also use that to share in streaming as well. Again, that unlocks a lot of efficiency for people.

Feast’s User Interface

I told you about the UI. One thing that’s really cool is we’re investing a lot recently in this. By we, I mean me. This is our feature lineage where we have a nice way to actually discover your features and see like, what kind of data do I have in there? You can filter by data sources, by entities, and all the other junk. Really, it’s about making data more discoverable for people. You can search through metadata here. We’re adding a lot more for data labeling and enabling like label views, because data labeling is really important to complement the data that you send in inference. You also want to be able to label it so that you can then fine-tune it later on or train it. You get a lot of really powerful metadata just from this. It’s really empowering the end users, model developers, data scientists, to not be bottlenecked by their problems for getting stuff into production.

Example – Retrieval-Augmented Generation

I wanted to go over some really important examples. Retrieval-augmented generation is the one I walked through with Docling. Most people are pretty familiar with it, but just in case you’re not, a user will ask a question to a chatbot. Usually, you annotate that question as a query. That query actually in real time goes through an embedding model. Then that query is then represented as a vector, a sequence of numbers, like 384 or something like that. Then that vector is passed into a system. In this case, the demo I had built was with Milvus and Docling. That vector similarity search will go into the Milvus database.

Then, essentially compute a doc product, and then rank order the top-K documents or chunks actually, however you put it, and return those. It’ll sort them by whatever has the lowest or highest score depending on which metric, to say, give me the closest set of chunks. Then it’ll return that back into the context.

Then give that back to the user. What’s important behind the scenes here is that, again, you can process and embed and insert your data however you want it. Maybe using, again, Spark, or Ray, or all these various different offline batch transformation engines. What’s really powerful about Feast is that we move beyond vector similarity search. There are things like hybrid search, keyword search, and then basic entity search. The way that we look at this in Feast is this is just retrieval. There are different forms of retrieval. There’s graph RAG as well where you restructure some things into some pretty sophisticated ways, but these are all mechanisms for just retrieving data.

We can express all of these things here and make those retrievable within Feast with actually just a little bit of code for your end user. I think this is really important for, especially infrastructure teams. Because what I’ve found is, again, I’ve seen entire infrastructure and data science teams be essentially shut down because they weren’t able to get their projects into production. Because they didn’t have a common pattern or language. Feast offers that.

Questions and Answers

Participant 1: I think the example you’ve listed here is for the retrieval-augmented generation, which I think is a pretty valid example. I think more commonly nowadays, there’s a bit more buzz around the Model Context Protocol. How does this fit in with that system?

Francisco Javier Arceo: We just added MCP support for Feast. Feast’s feature server is a FastAPI server. It’s pretty trivial to add MCP support to a FastAPI application. What that unlocks is for people to just treat it all like tokens. I have an NLP background. The Model Context Protocol treats everything as tokens. What we learned from the rich history of NLP and tabular modeling before is that there are other expressive forms, especially in information retrieval of data that ensure you can treat it in a context, but maybe that’s not the only way.

In fact, in hybrid retrieval, in more specialized retrieval systems, you might want to weight different things differently. You might want to include metadata in the weighting mechanism. You even have, like in rerankers, as a pretty sophisticated area where you’ll explicitly do this, and there’ll be small models that you’ll tune to just do this weighting mechanism. You can have that level of expressiveness in Feast. This is a pro and a con, where just treating everything you retrieve in the context as pure raw text tokens, that gets you started really well. I encourage people to do that. I think that’s actually a really great start. What people find is retrieval’s the bottleneck.

Once you get the right context into the model, you’re great. The model can do it right. Extracting the relevant chunk, it turns out, is really hard. You can reduce this to a classification problem, if you think hard about it. Where you, again, partition your document, you have n chunks, and you want to find which one of these chunks, maybe more than one, do I want to include in my context? When you reduce this to a classification problem, then you can say, then I can get really expressive and have structured feature representations about this context. That’s when you can really start to optimize it.

Back in, maybe it was 2017 or something, or 2019, Google open sourced their recommendation engine that they used for YouTube. It was one giant neural network with encoding for a bunch of metadata. Then they did this vector similarity search as well on top of it. My point is that, at scale, when you really want to maximize or saturate the efficiency and performance of an ML system, you want to start doing something more advanced. I always tell people like, start with just MCP and raw tokens. Once you want to really extract that last 20% or 15% when it really adds value to your business, you have to start looking into something probably a little more sophisticated.

Participant 1: The idea is that for the simple use case, the Model Context Protocol where everything is token, it’s perfectly fine. When you start thinking, which token do I actually want to return? Which chunk is actually relevant? Maybe you want to take the vector encoding of those. You want to create a classification dataset. You want to train a new model based on all those factors and then use that to retrieve the chunk. Then a mixed system where you’re not just dealing with tokens is better suited for that, which Feast is.

Francisco Javier Arceo: Exactly. Again, our aim is to play nicely and be able to support both, and handle some of the other problems that I mentioned as well. The original 2020 paper in NeurIPS that introduced retrieval-augmented generation was written by Meta, those folks were awesome. It was actually all about fine-tuning. There was a query encoder and then a generator. They fine-tuned both of these things. The fine-tuning of the retriever was, again, optimizing the retrieval aspect of it. There’s a lot of utility, there’s a lot of gains that can be had from optimizing your retrieval step. The generator is a bigger model. That’s the giant LLMs that we’re all familiar with.

Training or optimizing the query encoder or fine-tuning the query encoder, that’s actually pretty cheap. I do think that that’s an area where more people are starting to dive into that space. I think there’s a lot of people that were brought into AI that didn’t have the same traditional AI/ML background as some of the other people that were in the old days. I think some people are starting to learn these kinds of patterns like, where are the gains here? Hybrid chunking, as an example, was the first thing that people started doing with where, we can do keyword search plus vector search, and we can do better, because vector search alone sometimes misses the obvious keywords.

Then the reverse is absolutely true, where keyword search misses a lot of the syntactic latent context. People are starting to pick up on these things, and they’re starting to become easier. We hope within Feast to actually make this easier, so that lots of people can start using it, not just the old AI/ML, the ML traditionalists.

Some other examples, risk modeling. I came from a long background working for a bunch of dinosaur banks, which I like, because I think finance is interesting. I spent a lot of time in risk engines, whether it be fraud or credit and building credit models, and pricing, and decisioning systems. It turns out a lot of the challenges in getting those models into production, beyond the regulatory, are working with decision rules and engines and models between there. I had this joke that rules engines were the engineering of chaos.

Really, this is the bread and butter for Feast, where for basic entity retrieval, whether it’s user ID, or SSN, or whatever, Feast does really well here. Depending on which database provider, you can get extremely low latency at like p99s of 10 milliseconds, and scales really well. We have a GitHub repository demo there that shows a bunch of options for that. The example there is a credit risk demo. I think we deploy it on GCP. I think a lot of our examples are on GCP. It’s this pretty straightforward application. It was originally inspired a lot by work that folks from Uber did once they moved to Tekton, from Uber’s risk systems as well. You’ll see like driver examples.

Another example is recommendation engines. It turns out, and this is what’s a fun novelty, that once you peel back the onions of the implementation details, recommendation engines are quite similar to a lot of the stuff that happens for RAG, because some recommendation engines, once they want to get sophisticated, they pre-compute a lot of features. Then they use vector similarity search to find the content recommendations. In the same way with documents, there’s some analog there.

Then, some people do a little bit more advanced on top of that as well. You could also do the example I mentioned with Netflix, where you pre-compute your recommendations by a user ID, and then just serve those at runtime. There’s an example here, where there’s this offline store with some user features, product features, other features, a recommendation engine that generates candidate recommendations, and then does a reranker, or does training, and then uploads it to some database, and at runtime, it returns them. Then there’s some person or client, what movie should I watch? Again, Netflix was the famous example of this, with their collaborative filter, The Netflix Prize, that was maybe 15 years ago now. It was a really cool paper. I think it was a sparse recommendation engine. It’s a really fun math problem. Then we have a demo there as well.

What are some of the other benefits of Feast? There are a bunch of really great leaders that use it: Robinhood, Expedia, NVIDIA, Shopify, Capital One, Red Hat, obviously, Affirm. Twitter used to use it before they became X. I don’t know if they’re still using it. We have a lot of rich enterprise providers that are using Feast in production, and use it really successfully. Feast is a part of the Kubeflow ecosystem. I’m on the Kubeflow steering committee.

If you’re familiar with the Kubeflow community, we work a lot on large scale distributed training, KServe for inference. We’re making a lot of enhancements to making that end-to-end process of training, and serving, and running pipelines to do all those things with Kubeflow pipelines. We’re making that experience a whole lot better. That’s always great, because you have a community of experts who are super excited and willing to help you, and making it more scalable. We have a very thriving community, lots of contributors. A handful of maintainers that really love the project, are committed to its success. We have a Slack community that you’re welcome to join.

Then, that user interface I showed you before. I wanted to show it again, where we give you an overview of the feature view that we talked about. This is the field, some metadata tags, and stuff, and then again, the lineage. We’re going to continue to enhance to make this better.

What’s on the horizon? There’s a lot more natural language processing work that we’re continuing to do, and invest in that area, because we think it’s really important. Image support, being able to search vector images, or images in general, as well as their metadata. There are a lot of really rich use cases for image tagging that ends up being a pretty trivial example within Feast. Scaling batch with Spark and Ray. Adyen, one of the other adopters of Feast, they donated the Spark Feast offline store implementation, and we’re going to continue to invest in that to help making scaling RAG. Again, you have a bunch of documents, and you have a Spark cluster, and you want to be able to scale that pretty easily, and then serve that with symmetry in production. We want to enable that use case, pretty high priority.

Then, Ray data, if any of you are familiar with Ray, the computing engine. Famously, I think OpenAI had talked about how they were training a lot of their models using Ray, writing Ray data as an offline store, so that people can use their distributed computing using Ray or Spark. Sometimes some people use them interchangeably, or Ray on Spark. That is a thing. Latency improvements, we’re continuing to make Feast feature server faster. I think things can never be fast enough in the internets. Then, UI enhancements, like I said, we want to make the UI a really great experience for people.

Then again, a reminder, why should you care? I’ve seen a lot of teams go by because projects failed. I don’t want people to have that same experience. I want everybody to be able to unlock the production machine learning flywheel. I think there’s a lot of bad internet, and a lot of bad product experiences. I think AI can help to make that better. What should you do next? Please try Feast. You can pip install it. You can see our architecture. You can see our community. We recently launched a Kubernetes operator, and the community’s been using it, trying it with a bunch of different things. We even launched Milvus support and Qdrant, and some other exciting works in the way.

Participant 2: For the chunking part, does Feast suggest best practices or that has a default value for chunking for better outcome, like for context?

Francisco Javier Arceo: We don’t, actually. I also contribute a lot to Llama Stack as well, and they have an opinionated way of doing chunking. I think that’s good, because it gets you started very quickly. They do like a windowed chunking, and we don’t. We just give you the example that I gave there in the docs. We give you the toolkit, choose your own way to do chunking. I think the con of that is that if someone doesn’t know how to start, that makes it harder. We are planning on making that actually easier with some default stuff. Generally, we would say, choose what works best for you.

Participant 2: The memory generally for LLMs is hard to manage.

Francisco Javier Arceo: I think the question is, how do you manage the context size for an LLM? It depends on the model you use, because they’re going to have a context window maximum length. The big mega models have gotten all bigger for it. I still think that that ignores pragmatism that in-context learning starts to drop off the larger the context length is. I do think that it’s always best, when possible, to minimize context length to the degree feasible. We don’t have any tooling out of the box that we support that.

See more presentations with transcripts

Powering Enterprise AI Applications with Data and Open Source Software

Transcript

Context

Production Artificial Intelligence

Why Is Data Hard?

Why Is Consistency Hard?

Why Is Efficiency Hard?

Why Is Governance Hard?

Why Is Reliability Hard?

Feast: The Open-Source Feature Store

Demo (Code)

Feast’s User Interface

Example – Retrieval-Augmented Generation

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Gemini wants to help create and edit your drafts with some big Workspace upgrades

Satechi’s new folding dock adds USB, audio, and video ports to the iPad

This Slick Little Music Box Can Vibe Out Any Living Room

Why Using “^” Instead of “

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Context

Production Artificial Intelligence

Why Is Data Hard?

Why Is Consistency Hard?

Why Is Efficiency Hard?

Why Is Governance Hard?

Why Is Reliability Hard?

Feast: The Open-Source Feature Store

Demo (Code)

Feast’s User Interface

Example – Retrieval-Augmented Generation

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News