Transcript
Wes Reisz: In 1999, a young product engineer, young entrepreneur, I say young, he was about 28, had an idea that he wanted to bring something to market. The idea was that people would want to buy shoes online. While that seems a little obvious to us today, in 1999 it was a new idea, a new strange idea. What was even stranger is he had no product, no factories, he had no supply chain. What did he do to bring this to life? He went to a shoe store with a camera and he took pictures. He took pictures, he built a website, he figured out a way to do payments.
Then when people purchased the shoes online, he went back to the store, bought the shoes, and shipped them himself. From that became a product called Zappos. The person that I’m talking about is Nick Swinmurn, the founder of Zappos. Incidentally, Zappos was purchased 10 years later by Amazon, so you might say he had a little bit of success. Why am I talking about shoes? I’m going to talk presumably about code, about developing with AI. Because if you don’t build it right, if you don’t build the right thing, no matter how much technology you use, you’re going to go off rails. That’s the main thing that I’m going to be talking about today. Lean thinking combined with AI today is a superpower. People, expertise, and domain knowledge has never been more important to be able to build into your systems.
Background
My name is Wes Reisz. I’m a technical principal with Equal Experts. I was at Thoughtworks before that. My focus is building sociotechnical adaptive systems, which basically means building systems at scale with people that change based on the needs of the system. That’s what I focus on and that’s what I’m doing. Today, a lot of that means people want to talk about AI. That means people want to talk about building with these tools. What I’m doing in this talk is going to talk about that through an experiment that we ran at QCon London. You heard Dio, you heard Pia talking about experimentations, these one-on-ones that they’re doing.
Every QCon, every InfoQ Dev Summit, has experimentations embedded throughout the actual program. What we did at the last QCon, and I had the privilege to lead it, was to run an experiment to see if a certification program was something that people might want at a conference. A certification program really was never possible because at a QCon, there’s 5 concurrent tracks that are running, 75 particular talks that are happening. You’d have to have somebody cover every single one of those, bring it all together, to be able to do some form of certification. With AI today, it gave us an option to be able to do that. This talk is about that experiment. This is going to be diving into the weeds. It’s going to be jumping in and actually looking at curl commands, building a RAG, looking at embedding it. This is the details of what we built and how we built it. That’s what this talk is about.
Outline
It’s going to start off with the birth of a new product? It’s going to go in and talk a bit about the creation of this product, what that looked like, and how we tested the idea before we ever introduced the first bit of AI. Then we’re going to go in and we’re going to talk about how we used AI to deliver. I remember this time. This is where we’re going to branch and go into a little bit of the details on the feature branch here. We’re going to talk specifically about a RAG architecture, what that really means, and how it’s actually done. Then we’re going to talk about video transcription pipeline.
To be able to create something that I could use with all of these talks available to be able to create this workshop, I had to create a pipeline after every single one of the talks that would go through, that we’d be able to transcribe it, chunk it, store it into a retrieval system, a vector database, and then be able to retrieve it with a simple dense retriever, which we’ll talk about. Then be able to expose that in a way that I could use it and other people could use it. That’s what we’re going to go through. We’re also going to go through supervised coding agents. While I’ve been doing a lot of this stuff, I have not trusted supervised coding agents. I have been very much a naysayer about it.
In this particular project, I put that aside and I only use Claude Sonnet 3.7 via Cursor to be able to generate it. Ninety-five percent of what you’ve seen here was all generated with code. I’m going to tell you my observations of that, what I saw, what I did to try to constrain the boundaries so that it didn’t go off into wild hallucinations. I’m going to give you actual feedback of what I said. That’s what this talk is about. Then, lastly, I’m going to go into the workshop and a retrospective. I’ll show you what we did in the workshop. I’ll show you the output that was generated at QCon London. I’ll give you the retrospective, good and bad, of what I learned. Because this was an experiment and the idea was to get this in front of people that might use the system to test it, to figure out if this is a way forward that would be useful for QCon. This is the story.
This is the entire talk. Number one is, build the right thing. I’ve already started talking about Lean. Make sure you’re building the right thing. No amount of AI is going to help you if you’re building the wrong thing. Second, there are no silver bullets. Again, Michelle talked a bit about Frederick Brooks Jr., talking about there are no silver bullets. I will talk about that. Embrace change is the name of the game today, it comes with AI at the pace of innovation that’s actually happening. You want to embrace change.
Then last but not least, despite the amazing power of the AI tools at our fingertips, we still prefer interactions and individuals over processing tools, just like we did in 1992 when the Agile Manifesto was written. These are the takeaways. This is what I’m going to talk about. One qualifier. I’m going to talk about RAG. I know RAG is dead if you look at anything on social media, but it is one of the foundational things that we want to really understand with AI. It is also incredibly powerful. I’m not going to talk about agents, MCP, or A2A. All those are wonderful, but this talk is primarily about this experiment. In that experiment, I use RAG.
The Birth of a Product – InfoQ Certified Software Architect in Emerging Technologies
First off, birth of a product. The InfoQ Certified Software Architect in emerging technologies. I led the thing. You’d think I’d get the name right. This was a program, an experiment that we introduced at QCon in London. This was a cohort. We limited it to only 30 people because we wanted to test the experiment to see if we could truly use AI to deliver an impactful certification experience. Things that we did is we had special different events that are out there, pre-conference breakfast that you could get to know what’s happening behind the scenes, meet the chair, those type of things. Then we did special panels. We’re invite only for this cohort. We brought them together to talk about AI in the SDLC. We did those things. Then we had some lunches.
All of that then was on top of the conference itself, which was 75 different sessions that people could go to. Then immediately after, we did a workshop that talked about the key trends that architects coming to QCon should learn about, should know about. Like what came out of this conference, what was the key message. This is where we use AI and this RAG architecture to be able to pull everything together and pull together content that we could actually talk to, I could talk to, immediately the next day after the conference, which was not possible unless I had five people all working together to create that before AI. That was the InfoQ Certified Software Architect.
One little comment I’ll put out, the idea with Lean is a whole movement. We highly recommend Eric Ries’ book. He talks about this learn, build, measure loop where you test things, validate it, and repeat. Do it often, that helps you find product market fit. This is one of the things, again, that is at the heart of some of the experimentation that you see at QCon. The very first thing that we did on this was talk to people to see if there was interest. There was. The next thing to do, it may seem strange, but was actually to sell the product before we had the platform in place. What we did is we went out and on the 13th of January 2025, we opened up for ticket sales.
On the 15th, we sold our first ticket, and about a month later, we got to about 18. That orange line at 20, is the go/no-go line. If we didn’t hit 20, then this was going to be canceled. This is effectively seeing if somebody would buy the shoes, if there are people who are interested in it. What it also meant is that I had about that much time to April to be able to make this a reality. Everything that I’m going to show you was done in four weeks, and it was done while I had a full-time job with clients, working mostly in the evenings with a couple hours here and there to be able to do it. All of this was done in four weeks. We sold out about six weeks before the event at 31, and we intentionally limited it there.
Again, this was an experiment intentionally designed to be able to see if we could find product market fit. Put a slope on there, we probably would have sold 50 at the current, but this was just put it on the website, with very minimal actual marketing and push on it. Notice I haven’t talked about AI yet. This is a talk about AI. It’s about validating things, making sure we understand what we’re doing before we go forward. The first lesson, before I even talk about anything with AI, is make sure you validate your idea, understand your product market fit before you go forward.
AI in Delivery – Leveraging AI to Build a Modern Software Product
Let’s talk about what we actually are here to talk about, and that’s AI in delivery. The majority of this talk is primarily going to be around what I did to build the solution, what I learned from it, and things that I’ll share if you’re going to go build a system similar to this. First off, let’s talk about the plan. What did I say? We had 75 different talks that were going to occur at the conference. I needed to grab all of that information, bring it together, and I wanted to put that into an LLM so that way I could interrogate it and ask questions about what was happening. What was Matthew Clark’s key takeaways in his architecture talk on leading, scaling today, and shaping tomorrow? What were the key trends in platform engineering? What was Luca Mezza’s talk about AI? What were his key takeaways? I wanted to be able to interrogate it even though I wasn’t in all of those sessions, so that was number one.
In order to do that, I had to build a pipeline that I could take every single one of those videos into some place where I could break it down and store it in a way that had semantic meaning, not just keyword matches, but the semantic meaning that’s inside it so that I could find out key takeaways across a series of different things. That was built. The architecture there on the right, we’ll break that down, but this is what was built. It was a monorepo, all use serverless technology, all on AWS, built exactly how I would any other system that I would build. Integration tests, unit tests. Everything was Terraform. Everything had end-to-end tests. Everything was the way that I would build it, minus some of the testing I would have done for a little bit larger scale.
By and large, this is how I would build any system. I also wanted to make this available to attendees so that attendees could be able to engage with it, people that were in the cohort to be able to ask it questions as well. What I wanted to do is be able to ask questions and retrieve the point where Wes said this, and be able to start a video at that particular point. There’s metadata attached to this in addition to the meaning from the content so that we could start videos at specific points where people talked about something that we discovered to it. I said this was four weeks that I did it all in. If it all fell apart, then we’d use Gemini and just throw everything in its large context window because it had a context window of a million size. Didn’t have to do that luckily, but that was the overall plan.
RAG (Retrieval-Augmented Generation)
RAG, Retrieval-Augmented Generation, what is it? What I will start with is just a super basic description. When a user does ask an LLM for a question, it gives you a response. When it does that, it breaks things down for tokens, it transforms those tokens into vectors. Then those transformer layers process those embeddings that are based on what it’s been trained on. It generates probabilities for the next token, so it’s finding tokens to generate the next set of tokens. It’s not thinking.
Then it selects tokens for the response. That’s a little bit deeper than I even care to go at this point. What I want to just talk about is RAG itself. What RAG itself does is it injects a retriever ahead of the LLM. What it does at this point is it takes a question, goes against some structured to unstructured data to be able to retrieve the data, and brings it back, and literally just dumps it into the LLM so that the LLM has that in its context window. Why is that important? Because it reduces hallucinations. It gives you information that wasn’t available at the time that this was trained. If you go ask what is your cutoff date to an LLM, it’s usually about a year ago. This was a talk that was given an hour before. In order to get that real-time information, I had to inject it some way into it. RAG does a great job of giving you that type of information to be able to be available. It also gives you domain specific information that may not be available.
As I mentioned, it really does reduce hallucinations because you’ve given it a set. One other interesting thing is it also gives you explainability of what’s coming out of your LLM because it’s coming from this information. If it retrieves from a specific area, I can give that information so it helps me be able to explain where things are coming from a model. All those are valid use cases for RAG that you can use in your systems today. This is called a dense retriever, and it basically will be able to retrieve from something like a vector database. I want to stress here that I’m not doing keyword matching. When I said what are the key takeaways from Matthew Clark’s talk, I didn’t look at just key takeaways that were there. I actually mapped and found what the actual meaning was for key takeaways and returned that.
What does that look like? This is an example of writing it just as a curl command. This is actually calling the retriever. In this particular case, I’m asking, what is platform engineering? I gave it a key and my email, and this is the result set that returned. This is the dense retriever. If you look here, you’ll see different ones for what is platform engineering. Right there under title, there’s a small description. The next one’s a little bit bigger and then the next one is bigger. Again, it has information. The idea is to write this, pull back this information, and then provide that into the context window of the LLM, and then let the LLM do what it does extremely well. That’s the basics of a RAG. To make this really approachable, the way that we surfaced it is we used ChatGPT. Because GPT has custom GPTs, we could configure it. Here it’s authorizing the use to the connector. It’s reaching out and calling it. That query pushed back in looks like this when it comes out of the LLM. This is a RAG. This is what was built. This is a naive RAG, but this is how the system was used.
Let’s talk a little bit about RAG. RAG, we typically only hear about the naive RAG of just what I just built, but there’s a lot more depth to the architecture that’s available. What I just showed you was one low-quality retrieval result and several high-quality retrieval results. With that, I can do something like retrieve, re-rank them, drop out ones that don’t meet my level of threshold, and return just the ones that I think really provide a high level of value. That’s called a retrieve and re-rank. A multimodal RAG can do things like video, sound, pictures. Anytime you’ve uploaded a picture into ChatGPT and asked it to analyze something, it’s multimodal. Those are also possible with RAG.
A super interesting one is some work we’re doing as a partner with Relational AI and building graphs where we have leaves and edges actually inside of a knowledge graph. We’re connected because we’re at InfoQ Developer Summit. If you can encapsulate that into a knowledge graph, then you can ask questions that have many different vectors of meaning, not just the embeddings that I use for the dense retriever. That’s a graph RAG. A hybrid RAG blends keyword text searches with the deep contextual RAG work that I did so that you get better results. Then finally there’s agentic RAG, where you build these retrievers and then you have agents be able to decide which one to actually use to pull back the information that’s fed into it. All these are extensions on top of RAG that you can build, and these are real systems that you can build knowledge into your systems very easily. By the end of this talk, I hope you’ll be able to go build these simply and easily if you haven’t already.
What are the lessons here? A dense retriever converts queries into semantic meaning, again the meaning underneath it. When I say, what are the key takeaways? I’m looking for what are the actionable key takeaways that you can take back, not keyword matching on keywords. Second is chunking. It’s a critical step here. I didn’t go at all into this section. Chunking is how you build those individual pieces that were being returned to the RAG. What I did is I used wherever a speaker paused. When they paused, that was this thought, but that’s a very naive first implementation as is end of a sentence or an end of a paragraph.
This is a place where you can work really well with your ML folks to be able to really identify high-quality chunks that actually go in and get embedded so that you can actually return those. Chunking is super critical. Before you get deeper into any of the RAG architectures, make sure you look into how you’re chunking things. Then as I talked about, naive RAG is a way to get started, but there’s a lot of really cool other architectures that you can look at. There is a retrieve and re-rank to do it. There is a graph RAG. These are all very viable approaches and strong approaches to be able to get a really solid result.
Video Transcription Pipeline
Let’s go into the video transcription pipeline. I talked about the RAG, but how did I build the pipeline to actually be able to deliver this? It’s a Step Function. These are technologies we’re all familiar with, with software developers. What basically happened was an admin would start off by doing an upload into S3 that would trigger our Step Functions. The Step Function would kick off the first service, the first module that was in the system, that was the transcription module. It would reach out to the transcription service within AWS, transcribe everything that was happening there. The next step is that it would go back to S3. S3 would then kick off another portion for chunking, and it would pull out those pieces that I mentioned where the pauses were. All of those were loaded into an SQS queue so that we could process this in parallel. In parallel, during the conference, about 44 different instances would run for the embeddings.
Embeddings is where we took those chunks and we created a 1531 dimension basically floats, numbers that represent that meaning inside of a space, but with 1531 different dimensions of it, that allowed us to zero in on what that actual content meant, and we could compare it to other things. That was the embedding. That used the OpenAI service for embedding. Then once that was done, it stored it into Pinecone, which is a vector database used for retrieval. What did that actually look like in practice? This is the system running. Here is an S3 copy command that’s going in. You can see the metadata attached here. You can see this case was an mp3 file that was uploaded. You see the track. You see the day of the week that it actually went on. As it processes, it uploads it.
Then we’ll go over to S3 and we’ll take a look that it was actually uploaded. Then here you see it inside of S3. We go into our media folder, you see that it was there. Then from here, we’re going to go look at our Step Functions. In our Step Functions, you’ll see the first process kicking off of where it actually did the transcription. Up here is our transcription that’s running to be able to create the process. That takes about a minute to go. Rather than watch paint dry, I’m going to fast forward a little bit. You go and see the next few steps. It goes into chunking. It goes into embedding, storage. All those on the right are dead letter queues, so if anything failed, it would write to the dead letter queue. This basically was the orchestration that took everything that was being processed and put it out into a system that we could run.
Some basic numbers. Again, prototype, not huge scale. Seventy-five videos were processed over the three days. On average, they took about 3 minutes 45 seconds to process. About 15,000 embeddings were created from this, from the videos that were actually there. It cost about $130 over that day, and the prices are there. This ran very well with the testing that was done on it. There were no errors. Everything ran really smoothly for the actual event. What do we learn here?
The big takeaway that I wanted to express here was nothing that I showed here was not anything that we don’t already know. It was serverless technology. I used a monorepo on GitHub. I structured everything. I used Python serverless to be able to write these things. I used CI/CD tools. I used Terraform. I used GitHub Actions. All this was the stuff that we know and used, but I built a product for RAG with AI inside. While RAG sounds different to many of us, the tools and technologies that I used to implement it are what we’re already familiar with. There were some new things, OpenAI, which was an API, Pinecone, again, an API that were actually used in delivering it.
Supervised Coding Agents
The other thing that I wanted to talk about is what I made is a conscious decision here that I wasn’t going to write code. What I intentionally did here was use Cursor, and I used Claude Sonnet 3.7 for this one, not 4, to be able to generate almost all of the code. I’ll tell you about the cases where I didn’t throughout the process. I didn’t just vibe away at this. I was very intentional on how this was done. This is the process that I actually followed. The first part was a lot like a product or an architect iterating on an idea and shaping it with a low-cost LLM. Over about an hour or two, I just iterated over and again with what I wanted to do, what the architecture should look like. I had a good whiteboard session about what I wanted with some other folks, but I used this to iterate and define, use it almost as my muse to be able to validate my thinking.
From that point, I developed requirements of what I actually wanted to do from that shaping. From those requirements, I developed an old-school spec, back to waterfall almost. It gave me some bad flashbacks back in the day when I had to produce these things. However, I did it over a minute, not over three months that I was beat over the head with back in the day, but it produced a spec that defined what I was actually going to do. From that spec, I told the model, I want to use another model to be able to generate this code. Help me break this down into small individual pieces that demonstrate single responsibility. What this did is it broke down into a prompt plan for me to be able to approach how I was going to go about this. One of the problems that I’ve seen is if you give it too large a problem space, it gives it too much reins. The prompt plan direction makes it small and focused so that you can achieve better results as you’re going through leveraging something with chat-oriented programming.
From that point, I knew what I wanted to build, but I needed to go back into my architect or dev lead, tech lead role, and set up some guardrails. I used Cursor Rules files, so I set up the architecture with Python rules. I set up my testing that I wanted to use to make sure I was going to go through it. I established that everything would be infrastructure and code and use Terraform, and I embedded all of these into my prompts, so these were constraining what was actually being built. I urged and went towards one-shot prompts, so I could give it what I was trying to do, give it an example, and then work from my code from there.
Once I had my guardrails, went back into my tech lead persona, I bootstrapped the project. Again, I used AI, but I could have used anything to do this. Then from there, I iterated on those prompt plans and developed each of the individual features that I wanted to do in this particular system. I do want to call out one of my colleagues, Satya, here. He introduced me to this process, this way of thinking about doing it with one of the companies, Travelopia, that we were working with on some of this stuff.
What did this actually look like? Let’s go through and look at an example. This is the Git repo. You can see good, bad, and ugly of the code that’s out there that was generated. Each one of these are the modules that were actually created. I’m going to go in and look at one of the features that were developed. This was the service that retrieved data, did the upsert into Pinecone. Here you see my technical requirements of what I built. In this case, at this point, I used environmental variables that was later replaced with secrets. At the bottom were notes on using free tiers, things that I wanted to use initially. There were references there that’s scrolling past. Those were my Cursor Rules, the guardrails that I put up. This is what I used as the prompt that I gave to the LLM. Now let’s look at the source code that was generated. This is the Pinecone service that was actually created.
As you look at it, it’s readable. It’s not clever. I can read through this. It is understandable. It’s not overly abstracted. I probably would be a little more clever in how I tried to write it, but it did what I needed to. There at the bottom, you can see the dictionary right here about the data types that I specified in there. I didn’t just vibe away. I was very specific about what I wanted to do here, and I gave it that information so that it would produce what I wanted to build. In fairness, I didn’t always do that. In some cases, I didn’t have to. What I did is any prompt that I used, I just captured. It’s all in that Git repo. It gave a feel for what was working and what wasn’t working. I usually started from the prompt that the LLM gave me when I divided my prompts up, but there were some cases it didn’t cover everything, so I needed to create new ones. In this particular case, for whatever reason, the SQS queue was missed, so I wrote this one. I didn’t have to iterate back on the prompt. I just used this and went forward.
There is a danger here with what I call doom loops, because when you do that, you can get into this case where you lose sight of the original prompt, and you keep going into this cycle and cycle and cycle. If you ask the LLM to give you an answer, it’s going to give you an answer, whether it needs to or not. What’s crazy, these doom loops, you can see a problem that you’re trying to solve, and it will give you an answer, give you an answer, and all of a sudden you’re back to the original problem that it was solving in the first place. Try not to do that. Try to go back. If you’re using these to actually create code, go back to the original prompt and always edit from that point as best you can. That’s generally what I tried to do throughout the process.
Most of what I did, I was working individually for this, so I just used a single deploy.sh script to be able to deploy everything. It ran my integration tests. It did everything that you see on the screen. It was just easier for me not have to drop GitHub in the middle of doing a push every time I wanted to deploy something. About a month ago when I was getting ready for this talk, I wanted to have a good visual, so I decided at that point I was going to go ahead put my GitHub Actions in.
I actually did this after the conference, and I was glad I did because it gave me some other metrics that I hadn’t captured before. This is what actually was built. I did it via a Python script. This is actually GitHub Actions, but you can see each of the modules being built. You can see the Lambda packages being built. It uses Lambda layers because it has all of its dependencies in it, all of the infrastructure that was built from it, the end-to-end tests that was run, and then the deployment summary. The deployment summary is above there. You can see each of the services that were deployed and all of the infrastructure that was built to support this. This was built as I would any project. Again, I intentionally used chat-oriented program throughout the process just to get a feel for whether this is something that I would trust if somebody was truly paying me to do it.
These are some of the numbers that I went through. Again, Cursor, Claude Sonnet 3.7. Some basic stats over the month when I generated this. There were 977 calls. You can argue there are probably more than that. I’m a Java developer, not a Python developer. I intentionally used a language that wasn’t my first language to see if that was going to be really viable for me. I can write Python, but idiomatic Python was not really what I considered to be my core skill set. I knew enough to say if something was wrong. I wanted to use something I wasn’t absolutely familiar with when I went through it. I probably made more calls than you might if you’re super familiar with the language being created. It wound up costing, though, $39. That’s what it costs for Cursor in this particular price.
One thing that jumps out at me, it was 20,000 lines of code that got generated. I can’t imagine that I would have wrote that many lines of code for this. I haven’t specifically wrote it myself. I would suspect that’s super inflated on the number that was actually needed to get created. I mentioned GitHub Actions wasn’t part of the original project. I just used my deploy.sh. Same process, but I didn’t use GitHub Actions. When I went back and added it, I noticed a new thing that was available on Cursor, and that is it gives you the number of lines, agents that were edited, what was accepted, and what was rejected. I didn’t have that option or that view when I went back and did my basic stats. I went ahead and captured it here just to give you some examples of what it looked like.
Creating my GitHub Actions that were there, there were about 796 lines that were created, 420 of them were in my deploy.yaml. Again, I feel like that’s large. I probably could have been a little bit less than that. The README file was about 376. The README files, I was super impressed with the quality of what could be created from my code with README files with this. Very impressed on that front. Then, in this space, there were 3,000 lines that were recommended to me. I accepted just shy of 2,000. Those are sometimes editing previous ones, which is why you see it’s there. That’s what it looked like for actually using this to be able to generate the GitHub Actions.
What lessons did I get from doing this? This was one of the things that I wanted to get out of it. First off, about 95% of what you saw there was written. What’s with the other 5%, Wes? Why’d you do that? To be technical, it was when I lost my shit with the LLM because it was creating stuff and wouldn’t do exactly what I asked. I’m not proud of it, but I might have had a little breakdown with Claude, telling him just to do what I asked and stop going down loops. Then finally, smugly, I wrote it myself and said, do this. Had a little moment, but I got past it. About 5% of the code came into that. I think that’s interesting. These tools are not going to do it all, but it can remove undifferentiated heavy lifting on wiring things together. There’s times when you need to jump in and know what you’re doing and tell it what you want. I think this demonstrated that.
Apply a structured approach where you remain in control as you’re going through this so that you can control the outputs that are created. Well-structured prompts are crucial. Remember what I’ve said about how LLMs work, it’s building up this context. You’re putting things into the context so that it can predict, not think, mathematically predict the next set of tokens that are coming out. What you put into that context matters. Having well-structured prompts helps you with putting quality into the engine that’s producing this. The code you saw works. It’s reliable. It scaled to the level that I needed it to, but it was still a prototype. This was an experiment. I’m not sure I’m at the point where I would completely trust all of this. I talked about that code bloat that was created there.
If you read the last DORA report that came out, there was a lot of conversation about our batch size that was in it that was being created. My batch sizes were larger than I liked. What other ramifications would that have downstream? While I experimented with this, it was super impressive. I still remain skeptical on whether I’m fully going to use this in replacing my own abilities in different places.
One of the major things that I saw that I really disliked was code reuse was incredibly poor for me on 3.7. I had those different modules, and when I tried to do things like accessing S3, I would have one LLM that was working, another thing rewrite something else, break integration tests, to the point that I just said stop, and I kept things very independent, which means I’m going to run into issues down the road when I change that and have other places now that aren’t changed. Code reuse was pretty poor. Perhaps that was the way that I implemented it. Perhaps it was the speed, but it was an observation that I got from doing this.
Some would argue here, and someone actually argued to me, who cares? Just regenerate the code. You have that module, just regenerate it. If that’s your problem, so what, regenerate it again. I think that is a poor answer. Think about what will happen to MTTR if I have an issue in one area, two areas, three areas. I fix it, and it keeps running all over the place, and its behavior is non-deterministic and changes each time. I don’t think just regenerating the code for a small little thing is the right answer here. Beware of doom loops. I mentioned doom loops. When you get away from those initial one-shot prompts, the one-shot prompt is where you give it a sample and tell it to use that as your example.
Doom loops are when you start to get away from that, and you start asking it to fix something, fix something, do something, do something, and all of a sudden you wind up, you’re right back where you started in the first place. I’m like, what just happened here? If you focus on the prompt itself, it can help remove some of those doom loops. I mentioned batch size a little early, but batch size were larger than I would have liked. I think there are downstream ramifications when our code batch size is larger.
One interesting thing, though, the negative things that I said here, this is as bad as it’s ever going to be. Think about that for a minute. It’s as bad as it’s ever going to be, which means it’s only getting better. If you’re not already using these tools, you need to, because this is a major shift and change in software. Even though I remain a bit of a pessimist on how I bring all these things into my day-to-day work, every day I’m using it, every day I’m reaching forward with what these tools are. Make sure that you are, too. That’s the other one I bring in. I did not use headless agents here, so autonomous background coding agents in this process.
I had a very defined use case and I went with it. Had I done it today, I probably would have had some things. Looking at complexity that was there, looking at code reuse that I could have done, or things like that I could have been running in the background. One of my colleagues when I was at Thoughtworks, Birgitta Böckeler, actually wrote a blog just recently, an article on Martin Fowler’s blog that talks about some of the uses of autonomous background agents. I recommend checking those out.
ChatGPT
I built the pipeline. I built the RAG. I showed you how I planned to use it. How did we surface? The easiest thing for me at this point was really just to surface it to people. Everyone knows how to use ChatGPT, so I decided to use custom GPTs within ChatGPT. Let me show you how that worked. User would go in, ask a question. ChatGPT then would have some instructions that would basically, for this type of a question, call this retriever. The retriever would do the dense retriever that would call out. It would get an embedding for the question that was actually asked, and then it would look that up into Pinecone, that vector database. Then, as I said, the results would be returned and would be put back into the LLM. Just to come back to this example that I showed before, or a similar example, this is that simple retriever, that curl command that was there.
In this case, it says, what are the key takeaways for Kraken serverless architecture? You can look at the results that were there before. I’m going to show you that inside of ChatGPT. While it’s on the screen, I wanted to call out something else here. If you notice, this particular talk is one of the key takeaways about Kraken serverless architecture. If you look here at the bottom, Kevin Bowman was the speaker for Kraken serverless architecture.
Matthew Clark, who I mentioned before, was not. If you know anything about BBC’s architecture today, you’ll know they’re a heavy user of serverless technology. In this particular question that I asked the LLM for context to match, I said, what are the key takeaways from Kraken’s serverless architecture? It mapped the Kraken serverless architecture, but also mapped Matthew Clark’s talk because it referenced serverless. This is where you can use some of those other retrieve and re-rank to be able to grab these things and throw out things that didn’t matter, or use other approaches within the RAG to be able to retrieve better quality information. When I was talking about chunk size, what I was actually talking about is this text right here. You want this to have as much meaning as possible. The quality of what’s actually in these particular things are what’s going to be fed into the LLM. That needs a focus. This, again, was just where speakers paused.
Back to the ChatGPT story. How did we implement that? If you’re in a plus account, it’s pretty simple to go forward and add. You go down to GPTs, and you can see adding one here. This is what it would look like, so that you could actually go, type a question, and it would use that content. How was that actually created? If you go into the edit side of it, it’s pretty straightforward. Name, description, you give it some instructions. In the instructions, I tell it when to use certain pieces of information, when to use the schedule, when to use the retriever, how I wanted it to look like, what to check. All those go into the instructions, which is part of that context that gets used on what’s been generated.
If you look a little bit lower here, I also provided additional knowledge into this GPT. I provided the schedule and voting results, so you could actually ask the top-rated talk, what were the ones on platform engineering, things like that. Actually, implementing the retriever was also pretty straightforward. The actions, this is the OpenAPI spec. This is the API key or OAuth. In retrospect, I wish I used OAuth, but I used API key for this particular one. Then, if you go through here, just the traditional thing, you can see the API gateway endpoint that was surfaced. Then there’s some things down here that could be tested.
Then, of course, there’s a privacy thing. This is how I exposed it. In retrospect, I wish I hadn’t. There are some things about custom GPTs. The model that was used there was GPT-4 Turbo, which is a proprietary license. Because of that, when I ran the workshop, I had some people who didn’t have ChatGPT license, so they couldn’t use it. That created a negative impact on the overall workshop. I wish I had done something like built my own frontend and put LangChain or LambdaIndex in there, which are also very viable approaches to do it. This is the point in the presentation that someone says, why didn’t you use MCP? MCP was created in November of last year. I did this in February and March. It was still super early and new.
In retrospect, I probably would. MCP allows you to connect internal systems and be able to provide that into a model. What’s really powerful about it is its multi-step ability allows you to inject to different points. In some of the prompts that I created, I said, what were all the architecture talks? What were the key takeaways? What were the messages that went across that? What were the action plans? In those steps, it’s multi-step agents, and MCP is a really good solution for doing that. RAG is a really good solution for asking a question and getting a response. MCP likely would have been a better one at the time, but that’s in hindsight, in retrospect. That’s why we do experiments, to test an idea, understand what the real boundaries are, so we know what to actually build. I built this, like I said, in four weeks, so it served that purpose and served it well.
Workshop and Retrospective: Lessons Learned
How did it all come out? How did it work out? How did the workshop come out? Let’s talk a little bit about the retrospective. During the actual program, you see the tracks down across the bottom, the people who were in the cohort could go to whatever track they wanted to. To kick off the conference, we did an early breakfast. Did some behind-the-scenes, get to know the chair type things of the conference. Then we did, on day 2, an invite-only — it wasn’t invite-only — but it was supposed to be invite-only cohort for AI in the SDLC, where it was small groups, so that the cohort could interact with the speakers directly. We did focus groups on lunch, with me and some speakers talking about different tools and technology.
Then immediately after the event, what we couldn’t do before, we ran a workshop based on the content from the things that happened before. This was the key that AI allowed us to do. The use of coding agents was something I wanted to test out, but that wasn’t one of the core experiments, core hypothesis of what we were trying to do. The workshop looked like this. There was an hour lecture or talk, interactive discussion, where I went through what were the key themes about the conference. We looked at things like architecture. We looked at people, practice, and policy. We looked at emerging trends. We looked at four or five different areas for the software conference. We talked about what they were across the conference, and then came up with an action plan as to what to think about as you went back to your shops come Monday morning.
From that, we did this thing called an open space. What an open space is, it’s one of the things that are really popular at a QCon. It’s what they’re known for. It’s where you bring your ideas, your problems, and you use the peer group to be able to address those, talk about those, or dive a little bit deeper into them. We use those to be able to go into it, to be able to pick the topics that people wanted to focus on. Then we broke down into open spaces, which were four groups around different topics to focus on these trends that we talked about. Then we did key observations and wrap-ups.
What did it look like? This is one of the ones on architecture trends that came out of QCon London. The executive summary, this might resonate with some of you. Architecture is no longer purely a technical concern. I’d argue it never was. Regardless, architecture is no longer a purely technical concern. It is deeply sociotechnical. Systematic and strategic. From that, if you look at one of the five trends that we focused on, architecture as an organizational ability, architecture is shifting from central design to a distributed, participatory practice. It referenced three of the talks in talking about this. Architecture must consider social, cultural, and political forces that are shifting to technical decisions. It talked about how platform engineers are role models for scaling architectural patterns across systems that align with human systems.
Then mentorship can be used as a design scaling function. Suhail Patel talked about how folks at the architecture practice can be used to help scale what’s actually out there. The implications is that you should be embedding architecture ownership, autonomy, purpose, mastery, ownership across the org. Create feedback channels, not just diagrams. You should be leading by influence. That was one of the five things that we talked about. Here are the other four ones that we talked about with architecture. We did this for each of the major buckets at the conference. At the bottom here on the left, we came up with action plans on how to focus this across the org, across the platform, how to leverage AI design with this, governance culture with this. We referenced the sources for each one of these. This is what the RAG gave us the ability to do.
Something that I did not have the ability to do to be able to create without putting five people in each individual track who would come back and collaborate on what this was. AI gave this ability that didn’t exist before. The other part was I wanted this to be in the hands of the cohort members so they could ask questions during the workshop. This is the workshop. You can see one of the things down at the bottom there about the open space, the things on the left are the dot voting on which ones that they wanted to do. This was the groups meeting here. This is the readout afterward. Then the tool was in their hands so that they could retrieve and ask questions, key takeaways about the actual content that was done.
What were the lessons? Overall, it was a qualified success, but there were clear signals of integration, things that we needed to do. There was high participation, no red votes. It was rated about 89% green with about 56% super green. There were good ratings across that. On the survey afterward, we got a lot of negative feedback where things weren’t quite 9 and 10. They were a little bit lower on a 10-point scale. That had to do with efforts around organization and communication, things that I probably failed to index on because I was playing with the tech. I was too focused on trying to create the RAG and not about the experience of the people that were there. This was a clear one of a failing on my part. What worked? The audience that came to this were senior. Two-thirds of them were decision makers and strategy. We don’t typically see that in workshops at QCon.
The workshops, they tend to be a little bit more junior. This was an entirely different segment that we didn’t anticipate when we did this experiment. That’s one of the reasons why you experiment. The content instructor rating were high. Again, I probably had a lot of geek cred because I built this. The value of the content was really well received. The peer networking was really well done, it was really well received because these people stayed together throughout and made these decisions. There’s a lot of communication that we saw afterwards with people really embedding.
One of the big takeaways I got here, which became one of my key takeaways for this conference, is that despite giving them this tool that had all of the conference in their hand, they preferred talking to each other. We prefer individual interactions over process and tools. Even in a world of AI, this remains true. Don’t forget it as you build your systems. Lean thinking remains as important as ever to find product market fit. Then the rumors here of RAG’s demise are greatly exaggerated. RAGs are great. Yes, agents are hot. MCP is great. A2A is great. All these wonderful and deserve the attention that’s there. Don’t ignore what you can do with a RAG. Very simple process on how you can actually build it. On supervised coding agents, they’re incredible. However, I had to have the experience of what I was building to shape that. I would not have got the system if I hadn’t been building cloud native systems for a while. It would not have looked like what came out. Experience matters.
One thing that I’ll talk about that I really hadn’t intended is the second to last one. When I was done with this, I was burned out after those four weeks. Completely burned out. I needed some space to just get my head together. What’s interesting is everybody I’ve talked about who’s introduced programs like these, they’ve said the same thing. As you introduce these into your systems, beware of the human toil that’s actually on people. This feels like these tools can get you 90% of the way there, but Andrew Clay Shafer, one of the founders of the DevOps movement, said these tools will get you 90% of the way. I completely agree. All that’s left is you have to get the last 90% done when you do that. This is a product that’s being now created at QCon. It’s been scaled out. They’ll be running more of it. It’s still an iteration, but overall, the experiment was a success.
My final key takeaways, as I said before, build the right thing. Practices like Lean will help you do that. There are no silver bullets. Just when Frederick Brooks Jr. wrote it then, it hasn’t changed today. There are no silver bullets. Coding agents are amazing and super helpful, but you have to be able to provide guidance and expertise to be able to get the output that you want. Embrace change, that’s from Kent Beck’s book of ’92, ’93. AI tools, again, are incredible, but they’re happening at such a fast pace you can’t know them all. If you don’t get started, you’re multiplying by zero. Get started and embrace the fact that change is out there.
Then, last but not least, individual interactions over processing tools. These things are incredible, but remember and don’t forget the people and the process, they’re important. I started off talking about Zappos, but I didn’t mention the number. Ten years after that first experiment, Zappos was sold to Amazon for $1.2 billion. That was from a guy going to a shoe store, taking pictures of someone else’s shoes, and building a website from it. That was the foundation. Use Lean experimentation, embrace AI, and you can do incredible things. Lean thinking coupled with AI is a superpower, but don’t forget people, expertise, and domain knowledge, has never been more important. These are the references for my project. That’s the GitHub repo if you want to take a look at it.
See more presentations with transcripts
