Transcript
Sam Cox: Let’s set the scene. It’s 3 a.m., you get paged like this. All your systems are broken. In this moment, it feels like a make-or-break moment, not just for your project, but for your whole company. How did we get here? What are we going to do about this? Why is it all going to be ok? I’m Sam. I spent about 12 years as a professional software engineer. I’ve built a number of interesting systems and interesting architectures during that time. For the last two years, I’ve been CTO and co-founder of Tracebit, which is an early-stage cybersecurity company based out of London. If I tell you that two years ago, I hadn’t worked with C# at all, and today I’m going to be telling you about C#, you will recognize that I’m not an expert. What I can tell you is the real brass tacks experience of building a startup in an unfamiliar language, and in particular in C#.
Building Your Own Startup
Starting your own company or startup in particular is quite an audacious thing to do, because you want to compete with large companies, but you have very few resources. From day one, you’re quickly running out of time. When I found myself in this position, this really focuses the mind. What you really need to optimize for is productivity, which I’ll define here as spending time on what matters. Throughout my experience as an engineer, productivity has been a key aspect in what I’ve looked for from tools I’ve chosen. I think to be spending time on what matters is the core of a good developer experience. This goes further as a startup. I might write a blog, or I might talk to someone about my ideas, or I might go and attain some security qualification like SOC 2 Type 2, because maybe that’s the most important thing for my business at the time, rather than writing code.
The trouble is, it’s not always clear what exactly matters. When you start building a company from scratch, what I would recommend is that you want to quickly get started. On the basis of getting started, you want to get feedback on what you’ve built. Then you want to iterate. This is the core of Agile development. Hopefully, if you go through this iteration cycle enough, you will be successful in the long term. For an early-stage startup, what it looks like to be successful in the long term is to find product market fit, that you’ve built the right product for a market that wants it. It’s going to look something like this. This is an optimization problem, and you’ll take a step in the wrong direction or the right direction.
As long as you act on feedback, and you make those feedback loops quick enough, hopefully you will converge on product market fit. Of course, you will be operating under these constraints. Some of them are just inherent. You have very few resources. Your runway, cash in the bank, is running out day by day. You will also be constrained by choices you make, your technology choices. You want to think about choosing technology choices, which allow most scope, give you the most leeway to find this product market fit without constraining you more than is necessary.
Timeline – Ideation
All startups start with an idea, and this was ours. Let’s say you have an AWS account, and you have some resources in it. These are S3 buckets on jupiter-corp. I’ve got a jupiter-corp-config, static-assets. You’re worried about protecting the data in that environment. Of course, you don’t just have S3 buckets, you also have applications accessing those buckets. I’ve got a frontend, and I’ve got an inference worker. That sounds relevant today. You have applications accessing these buckets, which contain critical and sensitive data. Our idea was this. We were going to create what we call a canary bucket. You might know this concept as a honeypot.
Essentially, it’s a bucket which looks just like any other within the account. This is jupiter-corp-db-backups. It’s not used for anything, it just sits there doing nothing. All of these access patterns, from applications to underlying datastores, can be monitored in AWS. The source of telemetry here would be CloudTrail audit logs. Let me ask you a question. What would it mean if jupiter-frontend suddenly started accessing this canary bucket? What would cause that to happen? It’s not going to be some quantum bit flip that’s caused this to happen. Probably your engineering team have not decided, that looks like a good bucket to start writing or reading data from.
On the balance of probabilities, the most likely thing here is that jupiter-frontend has been compromised, whether that’s an application security or supply chain risk. The attacker is using its underlying credentials to try and discover data within your account to exfiltrate. That is clearly something you or your security team would want to know about as soon as possible. If that happens, we could ping you a message on Slack. That’s the idea. That’s what we wanted to build.
What was I going to use to build this product? If we think back to getting started quickly, getting feedback quickly, and iterating quickly, for me, what made most sense was to build this on AWS, because I had years of experience with AWS, and I could get started quickly. I knew I could iterate quickly in AWS. Similarly for Postgres, I had years of experience of Postgres. I also knew that I could iterate quickly within an RDBMS like Postgres, because I could use JSON columns where I didn’t need to specify my types exactly where it didn’t matter exactly. I could add indexes as required, and I could change my query patterns as required.
When it came to the programming language, I didn’t just stick to what I knew. I’d had about six years of Python experience, and I’d had about six years of TypeScript experience. I’d enjoyed those stacks. I’d built interesting systems in them. I was looking for something a little bit different in this case. Probably the key things I was looking for were a statically typed language with a batteries included experience, where I could have easy tools to build on with a very strong supply chain security story. Because I was going to build this system which would integrate with our customers’ AWS accounts and access their audit logs and see which resources were in there, so I wanted a strong supply chain security story, along with various other criteria.
If I went to Google and I asked, what stack should I build a startup in? Its AI summary would helpfully tell me something along these lines. If you think about what you would choose to build your own startup in, there’s a very good chance your solution would be on this list. I don’t think there’s a right or wrong answer, but it’s quite notable that C# is not on this list. Whereas if I go to Stack Overflow Developer Survey for professional developers, and I filter the ones which you could conceivably consider a backend language, you’ll find C# at fifth on the list. I don’t think popularity is the be all and end all, but it is an important consideration when it comes to examples, documentation, battle-tested libraries, ecosystem, hiring. It caught my eye, and I looked into it more.
Am I really saying anything special here? I went and chose the fifth most popular. Loads of people use it. Why is that noteworthy at all? When I blogged about this earlier this year, there was a Hacker News commentary which accused me of being solidly mainstream while cosplaying as a contrarian, which my co-founder enjoyed so much that this mug is sitting on my desk now. This really is contrarian, and I’ll give you some evidence. This is workatastartup.com, I think it’s called, and this is a jobs board for all Y Combinator portfolio companies. If you filter it for backend engineering positions, you get about 240 results. I went through them all. Their search isn’t so good for C#. I went through them all, and there are only three which mention C#.
Two of them are with a mishmash of other languages, and one of them is C# full stack, although it sounds somewhat apologetic or undecided about it. When it comes to the context of startups, this really is unusual, and that’s why I thought it was worth talking about today. I think it poses an interesting question, which is, what would make a language well-suited to large companies but not startups when all startups aspire to become large companies? I think a lot of what this has to do with is preconceptions that I shared before I started looking at this.
My preconceptions about C#, I hadn’t looked at it, were that it was expensive. That it was verbose and therefore slow to get started. That it was Windows-centric. I’d never deployed production systems in Windows. I didn’t know what that world looked like. I’d only ever used Linux. That it would somehow be conservative and legacy and slow to iterate, which was a key criteria of succeeding as a startup. In a word, I think maybe C# is popular but not trendy. When I looked at it, the reality was that I found C# to be free, open-source, cross-platform, heavily optimized for Linux environments. When I started reading some example code, I found it expressive, familiar, and modern. The details of this code don’t really matter too much.
This code exposes an HTTP endpoint, which takes some sales data. It groups it according to some notion of what time of day it is, and then it computes a load of aggregates on that, like the sum and the max and the most popular product. When I read code like this, it felt familiar and approachable to me because it has things like generics and anonymous functions, anonymous types, pattern matching, null coalescing operators, type inference. Coming from TypeScript, this looked very familiar to me and something I was interested in learning more about.
Proof of Concept
It was time to get going. This is our office. In the early days, it was just the two of us, my co-founder Andy and I. This is in Edgeware Road. I call it an office. It’s more of a corridor. That’s actually what it is, a corridor with a door on it. If I stretched out my hands, I could touch both walls at the same time. It’s time to get going, and I’m going to be solo engineering this. You’re going to start with a proof of concept. I searched, how do I get .NET? I downloaded it. Now we use Macs. The early days, I was using my ThinkPad, running Arch Linux. No one tells you when you’re CTO, you’re also going to be responsible for security and compliance and IT for commercial teams. That’s why we use Macs today. You download this thing, and you get this command line interface. I started it.
This command line interface gives me a really nice way to scaffold a new application. I run it, and it says, hello world. So far, so good. This command line interface does a lot more for me as well. It can manage my dependencies. It can build and run my code. It can test my code. It can format my code. You probably are aware of several stacks where you have a similar experience, where a command line interface does all this for you. I think I was surprised by how far .NET takes this. You can also use this to manage database migrations. You can use this to create a self-signed TLS certificate and put it in your trust store of your machine. In terms of getting started, giving an engineer a machine and saying, please get started, run this local HTTPS service, this really is quick. They’ve optimized for it. You can even go so far as to sign JSON web tokens if you want to.
If I think back to the project I wanted to build, I tried to piece it up into logical chunks. Starting from 1 to 4, firstly, I need to go and see what existing resources you have in your account so I can try and match them. I need to discover that everything’s called jupiter-something. I’m calling that inventory. Then, secondly, I need to actually create these canary resources. Thirdly, I need to monitor the telemetry that’s coming out of this environment to detect when canaries are accessed.
Finally, I need to send notifications to your security team when canaries are accessed. You’re a startup. You’re resource constrained. There’s one of you. You need to prioritize, and you need to be productive, spend time on what matters. If you think about this, steps 1 and 2 don’t really matter at this point. We just need to get going, and I can very easily look at your list of S3 buckets and come up myself with a sensible name. I can give you CloudFormation, or Terraform, or just ask you to go and create one. What I can’t do manually is read a load of logs and then tell you on Slack when one of them triggers. This is how I decided to start. We’re interested in more than just the hello world example. How did I start? What were my first few lines of code? I’m going to call this a modular monolith approach. I decided I had some logical services, which were detection and notification. I just said, there are these two services.
Then I defined a little command line interface. This is taking a list of flags. Each one is a service, so that can be notify or detect. Or if I don’t pass any, I get all of them. Based on the input that I receive there, I add this service. When I was first starting with C#, this looked a bit unfamiliar. What this is, is basically libraries built in, it’s called generic host. It basically is a way of orchestrating nicely services that you require in your application, including things like gracefully shutting down when you get interrupted and so on. It’s not a lot of overhead, but with those few lines of code, what comes out is a nice little command line interface where I can pass this either detect or notify. What does that give me? Within my IDE, it means I can run logical services very easily. This is just the same thing. I haven’t split up the packages artificially.
Remember, there’s only one of me. I can run detect if I’m interested in working on detect. I can run notify if I’m interested in working on notify. Or if I’m not really interested in either of that, I just want to run the application, I just press play, and everything I need is running. This is still the experience our engineers have today. They get onboarded, they press play, and away it goes. When you think about deploying this into production where we use ECS, this also has benefits in terms of getting started quickly and iterating quickly. Because, essentially, all I have to do is build one Docker image.
Then it’s up to me when I’m deploying this into production whether I really need to run these as separate processes or whether I’m happy for it to run all of the responsibilities in one process, which maybe is not perfect practice, but it’s very pragmatic. What it means is when we’re iterating quickly, we can create a new service for a day. Maybe I just wanted to do an experimental service. Maybe I want to refactor or change the responsibilities of certain services. That kind of constraint hasn’t propagated into my infrastructure layer, so it’s very quick to iterate.
What would one of these services look like? Here’s an example of the NotifyService. All it really does is like a long-running loop that waits for alerts from an alert queue. When it gets some, it sends them to Slack. Then it acknowledges that it sent them, so the queue shouldn’t return them anymore. These few lines of code, that’s how simple it is to get started. What have I said I want to do? I want to be quick to get started, which this is. I want to be quick to get feedback. I want to be quick to iterate.
On the iteration front, probably I’m not going to have chosen sensible defaults for much of this behavior. I’m going to want some notion of configurability. I can define a well-typed object, which defines exactly what I expect in terms of options. Here I have a batch size of 10 by default and a send timeout of 5 seconds by default. I say, I expect these options. Then because I’m doing this strange modular monolith thing, I’ve namespaced them under this notify.
If you look at this BindConfiguration, I’ve said, these options are going to be namespaced. I prefer to configure my applications via environment variables, so this means I can pass in these environment variables. When I realize in production, actually you need to massively increase the send timeout, it’s very easy to do. This will be type safely converted for me, which if I compare to if I was doing this in Python or TypeScript, it would be as many lines of code and more manual plumbing to achieve the same result. I also want to be quick to get feedback, so, logging, great source of feedback. I just ask for a logger, NotifyService, I need a logger. I get a logger. I get the best kind of logging, structured logging.
All of these logs are tagged with the service that requested it. I don’t need to go away and do any plumbing like that. Observability is about more than just logging, so I want metrics as well. I ask for this thing. I say, I want a meterFactory. What I’m going to do is I’m going to record a count of all the alerts I’ve sent. Every time I send an alert, I increment this count. What is this, like five lines of code? I get metrics. What do metrics look like? There’s this command line tool again, dotnet. It’s a familiar command line tool. I can ask it, I’ve got this process running called Tracebit. Can you give me the counters that pertain to the NotifyService? Then it’s going to update me how many alerts I’ve sent.
This is really interesting. This isn’t exactly what you would want from production. This isn’t what observability looks like in production. What it looks like in production is this. .NET has built-in support for OpenTelemetry. We spun up OpenTelemetry Distro for AWS, which collects them, converts them to CloudWatch metrics. Very easy to get started. Let’s spin up Grafana, and I got good feedback from my production system. What I think is really unique about .NET, at least in stacks I’m familiar of, is just how much the ecosystem seems to have converged on these particular solutions, which I think unlocks a lot of capability for very easily obtaining the goals you want.
In this example, I’m adding OpenTelemetry. Then I’m adding metrics, but also traces, which are an important component and are built into .NET, for my application server, how long it takes to reach AWS APIs, database queries, HTTP client calls. I get all of that for a few lines of code. I get a much greater sense of feedback than I would have in other stacks. I think that’s because Microsoft have provided these basic concepts within the platform, and everyone has converged around those. People will feel different ways about that, but it’s definitely notable. I’ve called out here, there’s a .NET design point. Industry standards like OpenTelemetry and gRPC are favored over bespoke solutions, which I was surprised by. That’s not what I would have expected going into .NET.
MVP – Minimum Viable Product
You start building this product. If you’re in this B2B context like we were, if you’re building a startup, what I would highly recommend is you go and find some design partners. These are forward-thinking companies, who are looking for a solution in the space you’re offering. You basically ask them, will you try my product for free in exchange for feedback? Feedback is a very important component of building a startup. We started to find some design partners who would run this early product. We still have more to build. I think the next notable thing about .NET is persistence, and in particular, Entity Framework as an ORM. This really surprised and impressed me. I’m going to jump into a more familiar domain model than my own product. Let’s imagine we’re modeling something that should be familiar. We have conferences, and conferences have talks. I can get started very quickly by just modeling some simple types, like a conference, it has a name, it has a start date, it has a list of talks.
A talk has a title, a speaker, and a reference to the conference. With these two bits, that’s all I really need to get started with persistence into Postgres. I do just need to define this context piece. This is like a handle to the database. It says, the database is going to have some conferences, and it’s going to have some talks. Under the hood, this basically refers to tables in my database. This is called a code-first approach. Just writing these four or five lines of code is enough now for .NET or Entity Framework to scaffold my database and create a migration, inserting those tables, which is convenient. It’s a nice way to get started.
I think what’s really surprising is when you take this further. I want to mutate my data model. I want to add another table. I want to add a column, and so on. I can do that very quickly. I can iterate very quickly by just changing those simple plain types in my code, and then ask for a new migration. That’s unlocked by taking a snapshot of this state each time you do so. Of course, you’re not just migrating your own local database. You have to migrate your production systems, and so you can easily ask it for a .sql. It’s surprising how far you can go with just migrating the database at the application startup time. Maybe not best practice, but in terms of getting started quickly, it really works well.
This flow, it may seem simple, but it is important to optimize. When I looked at how many migrations we’ve done in the last two years since we started, it’s certainly in the hundreds. Anything that can make that quicker is certainly worth it. What does it look like to actually use it? I can say, give me a handle to my database, and then I can create a new plain object. This is just a simple old object, and I can add it to my database. Here I’m adding a conference, and I can call SaveChanges, and away it will go and do some DML to insert that into the database.
In terms of modification, I go and take one conference at random out of my database, and I’m changing the start date, and I’m adding a related entity. I’m adding a new talk to it. Remember, this stuff in the middle, these conferences and talks, they’re just plain old objects. There’s no strange proxy behavior going on here, but Entity Framework will track these changes and commit them appropriately to the database. If you look at this as a piece of code, so I have some simple logic in the middle using plain operations on plain types, and it’s sandwiched as a unit of work in the database concerns, the persistence layer concerns, which is really convenient for getting started, because it means I can start with the simple operations on simple types. Away it goes. It selects these out. It does the appropriate updates, changing only the fields that change, and inserts a new talk. If I had to give one killer feature for why C# has let us operate so quickly, it would be the querying case.
This is an example of a query where we basically ask the database for conferences where the name starts with QCon, the year is greater than or equal to 2020. Of such conferences, I want to select the name, and I want to select a sample talk title from that conference. This looks very similar to if you were performing the same filtering and mapping operation on in-memory collections, but this is operating on the database. If you asked me as an engineer who had never seen C# before what this code was doing before I worked with it, I would say this looks like a terribly inefficient idea, because what it looks like it’s doing is pulling back everything from the database, and then doing an in-memory anonymous function to filter them. That would have been my assumption.
Then doing another in-memory operation to map them afterwards. I don’t know of other languages with a feature like this, because what this is, actually, is not an anonymous function, but an expression tree, a representation of this filter where the name starts with QCon and the start year is greater than 2020. An in-memory representation of that computation, rather than an anonymous function, which means that we can defer applying that execution, and we can translate it into SQL.
This is an example of what would actually run if I run that query. It does a subselect to find a sample talk title. It’s converted my name starts with QCon, that is, an operation on a string type, into the appropriate equivalent in SQL, which would be name is like QCon%. I’m not sure how optimal this would be, but it’s managed to translate my start date operation into the appropriate SQL. In terms of getting started quickly, this is really beneficial.
What does it give me? It means my whole experience is strictly typed, and I have the full benefit of the IDE behind me. When I want to query my database, I’m not worrying about SQL at all. My IDE is telling me, these are the fields that our conference has, and these are the types you can expect them to be. When I mess up and I say, I want to query by something that isn’t actually a property of that entity, it’s going to tell me immediately. End-to-end, my full experience is strictly typed, and that allows me to iterate very quickly.
For instance, I can use my IDE to give me all references of a particular column on a particular table in the database, which if I was using any query builder or just write the SQL yourself approach, would be much more difficult. I can use my IDE to refactor. We’ve renamed columns and so on successfully with a simple migration and a change to the model in here, which would have been very painful had we tried to do that any other way. I think it’s really powerful. Here’s another example of quick to iterate. You have all this enthusiasm. You start building your model, and you start building your system.
Then you realize, actually, I have this real big concern, like a cross-cutting concern like this, where actually it turns out I want a talk to have a status, which is whether it should be visible or not. In nearly all circumstances, I don’t want that talk to be visible. Had I used a query builder approach or had I used writing SQL manually, that would have been painful to achieve. This way, it’s really easy.
You can definitely shoot yourself in the foot with SQL very easily. Here’s an example where I’ve accidentally queried for a million rows from my database. Because .NET is uniquely converged on a particular set of solutions and tools, it means you have great integration throughout that stack. Here’s my IDE recognizing that I’m querying the database, and it’s warning me, you’re pulling back a million rows, and you’re allocating a lot of memory in the process, which isn’t an experience I’ve had in other stacks, that level of deep integration, not just for static analysis, but dynamic analysis.
Then, of course, best will in the world, you can’t model production on your local machine in all cases. You start looking at things like this. This is a graph of production load on my database, and you want to ask questions of it. Like, why does it spike every hour, and so on? RDS Performance Insights will helpfully tell me the SQL that’s generating the most load. I want to iterate quickly. I’m avoiding premature optimizations. I want to take the most meaningful changes I can make here, so I look at the top SQL. Away I go, and I can add this TagWith, because it’s inconvenient to just have the SQL, like where in my code is that actually represented? I can tag with a name, which is nifty, because then it will add a comment to the start of the SQL, so I can easily correlate. That’s not very quick to iterate, because I need to think of names.
Instead, I can just TagWithCallSite, and the calling method and its location will be put there. It’s really easy to iterate. That was too much for us. We added an interceptor, which does the same thing, which has some runtime overhead, but we want to be quick to iterate. Overall, what’s the experience like of using that? It’s very quick to get started, because you can model your data with plain types. You can model your operations, your units of work with plain methods or functions. Then you can easily adapt that to a real interaction with a database and the persistence layer.
Using techniques like this, I basically built the end-to-end system. The thing no one really tells you, when you’re trying to get people to use your product, and you want to give them demos, and you want them to get excited about it, and you want them to be design partners, is you need a frontend. It makes it a lot more compelling. Again, I reached to C#, so this was a really battle-tested ASP.NET framework for HTTP and HTTP servers in C#.
Here’s an example of what doing a frontend in C# might look like. It’s a component model, like an intersperse HTML with C#. I found it very convenient. In terms of getting started there’s batteries included all the way. We wanted an authenticated system. This is a B2B product. I just added AWS Cognito, which is an identity provider within AWS. I hooked that up to .NET, and I get easy single sign-on support for my customers. That’s what they’re concerned about. I also don’t have to worry about the headache of handling any credentials myself. I want to iterate quickly, but in a secure way. I can offload that responsibility. This is Blazor, and it has some interesting models. You can build your frontend in C#. You can compile that to WebAssembly. You can ship that WebAssembly to the client.
Then the client will maintain the state of the DOM effectively, and can interact with your server to get updates and so on. That’s neat and distinctive. Or there’s an alternative where you ship some JavaScript, which establishes a WebSocket with your backend, and now the state is on the server side in memory, and you pass events and DOM updates back and forth over this WebSocket, which is nifty. I don’t think this would have worked for us because we’re quick to iterate. We’re shipping code all the time, and so I don’t want this state to be in memory on the server when I want to be cycling out containers on a regular basis. I went for, let’s get started quickly, the simplest model, the one I grew up with, request-response HTML. Yes, we got started quickly with this. We have some JavaScript for reactivity, but we’ll see if we need to iterate on this.
All together, what does this stack give me? Because it’s in C#, I can share code with the backend, so it’s quick to get started. Because I’ve chosen to render the frontend on the server side, it’s strictly typed, if there’s errors in that, I find it in my IDE first. Or, if they make it through to runtime, I get errors in my application at runtime and I have appropriate monitoring, so I can get feedback quickly. I don’t have to worry about how to propagate errors from the client side because there’s really not much state happening on the client side whatsoever. It’s statically typed, and I have full IDE support, so it’s quick to iterate. I can see references to my entities in the frontend. I can see them in the backend. It’s very quick to iterate. I’m much more of a backend engineer than a frontend engineer, so maybe I’ve got this wrong, but I’ve been happy with the outcome. This is a bit of our frontend. This is a Lighthouse report, but it seems pretty performant and accessible and apparently best practices. Less good on SEO, but this is a private portal, so I’ll let that one slide.
Load Testing
We had some design partners now. People testing the product, people giving us feedback on the product, really valuable feedback. One of them really liked it. They were like, I can actually see how this would provide value to our business, and I’m actually tempted to pay money to run this across my whole AWS organization, but the trouble is we have a very big AWS organization. We have hundreds of AWS accounts. We have terabytes of audit logs, and when we’ve worked with security tools before, we’ve actually broken their systems. The scale has been too large, and those are from much bigger companies than you two in your corridor. What we’re going to do is we’re just going to start adding load to the system to see if it’s even feasible that we could buy this system from you and have it work.
Unfortunately, they are based in the West Coast of America, so when they started doing this, this is when I got my alert at 3 a.m. This felt like an existential moment. It’s like, a company wants to buy from us. This could be the start of a great thing, but the system’s not working as designed, and so I need to, as quickly as possible, get feedback from the system and iterate on it. Thanks to the metrics and monitoring, I at least knew this had happened, and I could start acting on it immediately.
With the best will in the world, all the metrics and tracing you might have, what I did and what I wanted to do immediately was to collect stack samples from my application. I wanted to see what it was actually doing. It was clearly not performing, but I wanted to see where it was spending its time. This is dotnet-trace tool. Here I’m telling it, there’s this process Tracebit, can you go and collect 30 seconds’ worth of samples from it? Output it in Speedscope format. You can also output it in formats that you can use with like Chrome Debugger, which can be handy on occasion. Crucially, I could run this when I needed it without having to pass special startup arguments, which if I’d been in TypeScript, or Node, or Python, I think previously I would have had to have anticipated this need before I’d be fiddling around with command arguments to my container. Here it was just ready to go. I was ready to go and collect this information.
What you get out of it is something like this. It’s going to be, in some cases, a bit hard to interpret, but broadly it’s going to be telling you where you’re spending your time. This was invaluable for me to react quickly enough to save the day here. In some cases, I identified some strangely expensive operations which were actually due to performance regression in the AWS SDK, which was a nice win because then I could just upgrade that library and get a bit more performance, but of course, in most cases, it was my code that was at fault. Here’s an example. This is the inventory service, the piece which collects information about all of your resources running in AWS. It doesn’t really pass a sniff test, does it? I think if you look at this and you’re like, yes, four nested loops, but actually get started quickly. You don’t have all the time in the world.
This is actually a great way to get prototyping. What it’s doing is basically there’s a lot of AWS accounts in an organization, it’s getting them all. Each one has regions, it’s getting them all. Each region has enabled services like S3 and Dynamo, it’s getting them all. It’s listing each resource in each service, getting them all.
Then, finally, from those identifiers, it’s asking, describe this resource for me and stick it in the database. There’s a number of problems here. It’s wearing the mask of concurrency. It’s got all these awaits, but there is no concurrency here, at least with respect to itself. This is going to be very slow when you have hundreds of accounts and millions of resources. Also, I’m doing a database insert for each one. I’m adding a lot of unnecessary transactions and loads to my databases because there’s no batching.
What did I do about this, and what did I do about this quickly? I went and found this library Microsoft have called Dataflow. Essentially what I did here is I wrapped those existing methods. I didn’t adapt any of those methods. I wrapped them in these notions called blocks. These basically represent chunks of computation that you could perform. The first three are one-to-many, they’re TransformManyBlock. The first one takes an account, and it returns many regions.
The second one takes a region and it returns many services, and so on. The middle one, TransformBlock, is one-to-one. It takes a resource ID, and it goes and gets details about it. Then database insert is one-to-zero. It just performs an action, and it doesn’t need to return anything. I haven’t had to change the underlying implementations. I’ve just changed how this code is structured, and I’ve wrapped them in these blocks. These represent nodes of computation, and this library, Dataflow, allows me to connect them in a graph. Here I link them into, once you’ve got the regions for an account, you should send them to the thing which can get services for a region, and so on.
Until you get to the point where at the last line, we’re inserting resource details into the database, and I tell it, also, you should propagate completion, which basically means when your preceding block has completed and you’ve completed, you should say, I’ve completed, and the next one in line should also consider itself completed so we know when this pipeline has completed. This just sits there doing nothing. What I need to do is if I have this pipeline of computation or this graph representing this computation, I need to seed it with some stuff, which is the first step, the account. I list all the accounts, and then I send it into the pipeline, and I say, I’m done. You shouldn’t expect any more accounts. Finally, all I have to do here is just wait until the last step in the pipeline, the insertion has completed.
I look at all this code together, and what’s really changed? It’s got at least twice as long. It’s got way more abstract and confusing. It passes the sniff test a bit better, because it’s harder to see exactly what’s going on, but I haven’t really changed very much here, and therefore, it does exactly the same thing. It’s just as unperformant as before. The only thing that’s really changed is it’s gone from this appearance of nested loops to something that looks more like a pipeline. It feels more approachable. What it really does is very quickly let us iterate to solve the actual problem at hand.
The first problem I identified was that this was going to result in millions of small transactions on the database, which is not good for performance. A sensible thing to do there would be to batch them up. Without changing any of the underlying methods that perform these actions like list regions or database insert, I can insert another block here. This is a batching block, which will collect a group of 10 things to be inserted, and then emit those as a batch. I can just thread this into the appropriate place within this pipeline, and then I’ve magically reduced the number of transactions I’m performing on my database by tenfold, which is a really good start. It reduces load on the database, but I still have this same underlying problem. I still have a problem that it’s slow. It’s not really concurrent. I can solve that very easily, so I can pass these options to these blocks, which represent computation.
I can say, instead of just running one at a time, you can have up to five running in parallel or concurrently. The trouble with if you were to just say, go wild, go as concurrent as you want. You have some problems. We’re interacting with the AWS APIs. They’re going to rate limit us, so you want some bounded concurrency. Also, depending on where the bottleneck in this pipeline is, if it turns out the bottleneck is inserting into the database, like I don’t want the preceding steps to go wild and start putting all this stuff in memory, so I say, only allow a queue of up to 100 things before you stop accepting new things, and then the thing that’s sending them to you has to wait.
By doing this, I get a system that actually looks quite sophisticated. This looks much more like what you would build if you were building this out as architecture, maybe based on a message bus or maybe serverless functions with concurrency between them and even backpressure, which not all such systems would have. You get this in 20 or 30 lines of code, and crucially, this hasn’t propagated to my infrastructure layer. If I need to change this, it’s very easy. It’s a few lines of code. Because it’s probably not appropriate that the concurrency of database writes is the same as concurrency of asking AWS for services, if I need to tweak it, it’s very simple. It’s just in memory. It’s in code. I could one day scale this out to a true horizontally scalable system based on exactly the same principles, but we want to be quick to start, quick to iterate, quick to get feedback. It wasn’t necessary, so I did this, and essentially this saved the day.
I was very grateful at the time that there was a battle-tested, well-documented library out there from Microsoft that I could trust that would help me solve this problem. It’s worth saying I was also surprised how few problems I actually needed to solve from a performance perspective, so this is TechEmpower’s benchmarks, .NET. The standard stack, the one that you would just pick up off the shelf, it actually performs amazingly well. This isn’t like going and looking for hyper-optimized niche solutions. This is just what you would use. I think it’s because the .NET team have spent an enormous amount of time optimizing performance. I enjoy reading this every year. This is a 408-page blog post about performance improvements in .NET. I don’t have to think about this most of the time, but I accrue the benefits from it.
Expand
Off the back of that, this prospect was happy. They enjoyed the product, and they were confident that they could scale out to their whole organization. They were willing to become a customer. They were our first customer. They paid us money for this. Off the back of that success, we went and raised a round of investment, we raised a seed round, which has really unlocked everything since then because it’s allowed us to go from one engineer, me in the corridor, to hiring six engineers, all of whom had no prior experience of C#, all of whom shipped to production on their first day. We’ve made thousands of commits and releases since then with no major regressions, thanks in part to the tooling and the techniques that I’ve demonstrated here. We’ve kept the velocity. We’ve gone from one platform supported, AWS, to many platforms supported using this canaries approach.
The velocity has scaled with the number of engineers. I know these are early days, and that won’t continue forever, but it’s been beneficial. We’ve gone from one customer to many more. Now our customers include Docker, Synthesia, Riot Games. Based on .NET, this has been powerful. If you are thinking about starting a startup, I would encourage it. It’s hard work, but it’s a lot of fun. It doesn’t look like this. You’re not building a perfect vision from a perfect specification you know ahead of time. You don’t have months for groundwork and laying the foundations. It also shouldn’t look like this. This is my son’s favorite book. You’re not just haphazardly stacking things on top of another that don’t really work too well together. The model I would think of it as is more like this. This is the Ship of Theseus, and this is a thought experiment.
If you take a ship, and you replace it plank by plank, component by component in a piecemeal fashion with new parts, does it ever stop being the original ship? Not really. I would argue, you start with the balsa wood, very easy to construct, but you don’t need to go all the way to steel. You see what actually fails. You get feedback, and you see what you need to replace, because that’s the most efficient way to do things. Because at the end of the day, when you’re starting a new startup, something that’s never been done before, maybe no one wants to ship your building. Maybe they don’t want to ship at all. If productivity is spending time on what matters, what matters is figuring out whether anyone wants the ship. I would encourage you to try it, but use tools which let you quickly get started, quickly get feedback, and quickly iterate.
See more presentations with transcripts