Key Takeaways
- Modern embedded systems must reconcile increasing software complexity with stagnating memory limits, pushing developers to adopt languages like C++ while optimizing for binary size due to stringent hardware constraints.
- C++ offers zero-cost abstractions that allow high-level programming without runtime performance penalties, but developers must remain aware of how language features like templates, smart pointers, and STL usage affect binary size.
- Tools such as Bloaty and Puncover are essential for understanding and managing binary bloat, providing insight into which components and design patterns contribute most to firmware size.
- Trade-offs between runtime efficiency and binary size should influence architecture decisions, such as preferring concepts over polymorphism or using simpler standard library alternatives like <cstdio> instead of <iostream>.
- Binary size optimization is a full lifecycle concern, best addressed by integrating size tracking into CI pipelines and making conscious decisions around language features, toolchain flags, and design scalability.
When thinking about the kinds of products software developers work on, we mostly think about web services, desktop applications, or high performance computing, such as training an AI model, in a cluster of servers.
When I think about software development, I generally look at my circuit boards, sensors, and LEDs over my desk. These “tiny” gadgets are usually called embedded devices. Although single-board computers, such as the Raspberry Pi, are also referred to as embedded Linux systems, we are going to focus on microcontrollers.
This article investigates constraints faced when writing software for microcontrollers, the current landscape of C++ development, and how to tackle one of the big challenges when building scalability and complexity: binary size.
What are microcontrollers?
This class of chips does not run fully-fledged operating systems such as Windows or Linux. Usually, they run more lightweight real-time operating systems (RTOS). Sometimes the requirements are so harsh (or the computing power so limited) that applications are written in bare metal, with direct access to the interrupts and registers of the processor.
In many product segments, microcontrollers are employed as the fundamental processing unit in the system, where efficient and low-power computing is required, i.e., environmental monitoring, industry sensors, and home automation. The term Internet of Things (IoT) is commonly used to define these use cases that rely on a network of tiny sensors and edge computing nodes.
In comparison with conventional processors, microcontrollers have a greater diversity of hardware architectures.
This fragmentation is also reflected on the software side, where we see a greater reliance on proprietary tooling and software development kits (SDKs) to accommodate the low-level hardware access for each chip and its architecture. Because of this close relationship with the hardware, microcontroller SDKs have almost exclusively been written in lower level languages, especially C.
Frameworks and Languages
Building upon SDKs, we are already able to develop full applications, and even directly access hardware intrinsics such as interrupts and registers. However, being so close to the hardware makes the effort to develop complex programs and reuse them in multiple platforms significantly higher. Because of this increased effort, we usually build applications for microcontrollers on top of an RTOS. These frameworks provide an abstraction layer to manage multiple threads, configure their priorities, and appropriately schedule their execution. Notable examples in this space are FreeRTOS and Zephyr.
To keep up with the evolutions in both software and product requirements, embedded development needs to leverage the expressiveness and ease of use of higher level languages. Programming languages such as C++ and Rust are also being used in the embedded space now, which brings a whole new set of challenges to tackle, because these languages were primarily developed without such harsh restrictions in terms of memory and computing power.
C++ in microcontrollers
C++ is a general-purpose programming language that has been developed for the last 40 years. Updates to the language are introduced by the standardization committee in the form of versions. More recently, the committee has followed a three year cadence, with C++23 being the latest revision. C++ features are generally divided in two major categories: language features and library features. Language features define the core functionality and syntax of your written code, such as the meaning of keywords, mathematical operators and what their operations mean.
Library features introduce additional utilities, generally in the form of objects or functions, which are written with core language. Commonly used libraries are strings, data containers and algorithms. While language features are intrinsically implemented by the compiler, library features are available in the form of a library that you can link to your program, called the standard library.
Developers of microcontroller projects usually avoid using the standard library, because several standard library features, such as data containers, use dynamic memory allocations. As previously discussed, microcontroller chips have very limited volatile memory (RAM) and usually run applications on bare metal or a simple RTOS, without the presence of a memory management unit (MMU). Because of this memory limitation, during the application’s lifetime, the constant memory allocation and deallocation can generate fragmentation. Memory fragmentation is a process of scattering available memory in a program through allocations and partial deallocations. After a certain time, the program is unable to allocate new contiguous memory blocks, although the total available memory could indicate so. The process is visualized in the picture below:
In failing to allocate memory, the program will either throw an std::bad_alloc
exception or terminate. In the context of a microcontroller firmware that doesn’t support exceptions, this means your firmware run will always consume usable memory until it inevitably crashes and reboots again.
To avoid dealing with dynamic allocations, a popular alternative is to use the embedded template library (ETL), where all objects have static memory. In this way, the developer has a very deterministic view of how much memory will be used during runtime.
Having said that, using the standard template library (STL) also has its benefits. With the STL you get access to the latest library features developed for C++ and their synergies with new language features. The entry barrier for developers is lower, because they have more resources to learn and rely on. As we make more and more objects and contexts available for evaluation in compile time, with the constexpr and consteval qualifiers, we can also perform more logic in compile-time, thus reducing the need for allocations in run time.
More importantly, the dynamic allocation disadvantage that I previously mentioned by using the standard library might actually not matter, depending on the domain you are working with and how you design your application. So let’s say we have an application that can be fully characterized on boot time, such as a single-function device like an appliance. Another example of such a system is if it has separate execution and configuration modes, and a complete reset or restart of the application is required between them. This means you can fully define and allocate your objects during initialization and then have a stable memory footprint during your application’s run time. I would even go further and relax this constraint if you have a stable allocation and deallocation cycle, or cycles from different tasks don’t execute at the same timespan so we don’t have irregular allocation cycles that generate fragmentation.
Therefore, we should not rule out the standard library from the get go. Analyze your application domain and execution lifecycle, and then assess if it is suitable in terms of memory footprint. It is also notable that the standard library is bringing more embedded-friendly features. For instance, C++26 will introduce a fixed capacity container in std::inplace_vector
, where the user can try to push back an element and the container will check if there is still space left to perform the operation. Hopefully, we will see more of such relevant features in upcoming standards.
C++ evolution
C++ has seen an impressive amount of language and library features added to the standard in the last 20 years. There are many aspects to talk about as to how this has affected the software architecture and development, for microcontrollers.
The two main effects I want to highlight are that we can write C++ code with a higher abstraction level than before, because we have smart pointers and auto type deduction. We also have more powerful libraries, such as format and print, that save a lot of development effort with memory management, type system, and so on. Additionally, the language and its libraries have introduced more powerful syntax for performing complex operations, such as variadic templates, fold expressions, and ranges.
Let’s combine some of these aspects in a small example and see them in practice. We have a system that accumulates data points of different types. We can generalize the data point implementation in a templated structure, where further metadata can also be stored:
template <typename T>
struct data_point {
T value;
};
The specification is to have a common interface, through which a series of values from the same type are processed and then sent to a certain peripheral. In the end, our client code will look like this:
int main() {
// 1. process value based on integer or floating point
// 2a. send integers through peripheral a
// 2b. send floating points through peripheral b
// 3. return std::vector<data_point<T>>
auto ints = process_and_send(16, 32, 48, 64);
auto floats = process_and_send(1.f, 2.f, 3.f, 4.f);
auto doubles = process_and_send(5.0, 6.0, 7.0, 8.0);
}
As we can see, the client code is fairly high-level. The user doesn’t need to worry about explicitly indicating the type to be processed, although C++ is a statically typed language. The code underneath should take care of resolving the correct use case.
First, let’s think about the peripherals. We need to define an interface for the serial driver, through which we want to send a sequence of characters and return a boolean to indicate if the operation was successful. Instead of virtual classes and polymorphism, we use concepts to define the constraints a type needs to satisfy to be a serial driver:
template <typename T>
concept peripheral_like = requires(T drv, std::string const& str) {
{ drv.send(str) } -> std::convertible_to<bool>;
};
From this concept, we can write classes for each peripheral that satisfy these constraints. In order to get the desired peripheral in each instance, we have a function that resolves in compile time the wanted peripheral with respect to a certain id number that enumerates all peripherals in the system, and then returns a smart shared pointer. We omit the implementation details to keep focus on the overall architecture of the code:
class peripheral_a { ... };
static_assert(peripheral_like<peripheral_a>);
class peripheral_b { ... };
static_assert(peripheral_like<peripheral_b>);
template <unsigned int id>
auto get_peripheral() { ... } // std::shared_ptr of peripheral_like type or nullptr
Now we turn our attention to processing the values we want to send. For that we use a feature introduced in C++20 called concepts. Concepts provide a simpler syntax for defining requirements for types in template metaprogramming. In this example, we overload the function process_and_send_dp
by having implementations attached to different concepts that constrain the type of incoming data. In each implementation, the data point is post processed, then we acquire the corresponding peripheral and send the data under a specific formatting scheme.
template <typename T> concept is_integer = std::numeric_limits<T>::is_integer;
template <is_integer T>
constexpr data_point<T> process_and_send_dp(T &v) {
v += 1;
auto drv = get_peripheral<periph_a_id>();
drv->send(std::format("0x{:04x}", v));
return {v};
}
template <std::floating_point T>
constexpr data_point<T> process_and_send_dp(T &v) {
v *= 2;
auto drv = get_peripheral<periph_a_id>();
drv->send(std::format("{:.2f}", v));
return {v};
}
At last, we need a function that continuously calls process_and_send_dp
through the whole data series and constructs the resulting vector. Again, we can use the auto type deduction, now in combination with variadic templates and fold expressions to process the entire list of input parameters.
auto process_and_send(auto... values) {
using value_type = typename std::common_type<decltype(values)...>::type;
std::vector<data_point<value_type>> v;
(v.push_back(process_and_send_dp(values)), ...);
return v;
}
The resulting implementation is able to identify the common type among the values being sent and use the according processing, formatting and peripheral:
int main() {
auto ints = process_and_send(16, 32, 48, 64);
// 0x0011 0x0021 0x0031 0x0041 over peripheral a
auto floats = process_and_send(1.f, 2.f, 3.f, 4.f);
// 2.00 4.00 6.00 8.00 over peripheral b
auto doubles = process_and_send(5.0, 6.0, 7.0, 8.0);
// 10.00 12.00 14.00 16.00 over peripheral b
}
By seeing the example above, we can see that we have an interface that is quite flexible and powerful. For instance, we don’t need extra configuration parameters or different function names for each input type. Intuitively, bringing such flexibility would also mean performance overhead, because the underlying implementation has to resolve the different use cases under the hood. This overhead in object-oriented programming with C++ usually comes in the form of vtables and its indirections need to resolve which derived class is going to be used. However, this is not necessarily the case and C++ has many techniques to mitigate this overhead.
The constexpr / consteval contexts and template meta programming can shift a lot of logic and type resolution to compile time, when combined with move semantics to avoid copies,inline code for speeding up code execution, and even precalculated values with tables.
By choosing the right function to be called at compile time, we don’t get any performance penalty by writing the code in this manner. A common term for these techniques, where we can write more complex functionality with minimal or no performance overhead, is zero-cost abstraction. We can write client code in a less specialized way with no downside. Even though that sounds like the perfect world, there might be trade-offs in other regions besides code complexity and run time that we are not accounting for. In C++ development, some features and designs can have a great impact on binary size. That binary size can be a particularly important criterion for embedded applications due to their hardware and cost constraints. Before continuing to look at such features, let’s try to understand why binary size is such a critical constraint in certain types of hardware.
Hardware Perspective
Although we are here focusing on software, it is important to say that software does not run in a vacuum. Having an understanding of the hardware our programs run on and even how hardware is developed can offer important insights into how to tackle programming challenges.
In the software world, we have a more iterative process, new features and fixes can usually be incorporated later in the form of over-the-air updates, for example. That is not the case with hardware. Design errors and faults in hardware can at the very best be mitigated with considerable performance penalties. These errors can introduce the meltdown and spectre vulnerabilities, or render the whole device unusable. Therefore the hardware design phase has a much longer and rigorous process before release than the software design phase. This rigorous process also impacts design decisions in terms of optimizations and computational power. Once you define a layout and bill of materials for your device, the expectation is to keep this constant for production as long as possible in order to reduce costs.
Embedded hardware platforms are designed to be very cost-effective. Designing a product whose specifications such as memory or I/O count are wasted also means a cost increase in an industry where every cent in the bill of materials matters.
One important specification is non-volatile (NV) memory, such as flash memory, which is used to store the firmware and data needed by the application. NV memory has a great impact on the die size of microcontrollers, so chip designers will include the minimum amount needed to run an application. Looking at major vendors, we still see most microcontrollers offering less than 1 megabyte of internal flash memory. Even the more modern and capable chips do not exceed 2-4 megabytes. At the same time, the architectures and node technology evolutions made the compute capability of these tiny chips much higher and thus capable of running more complex applications. Looking at these contrasting points, we can see that one of the greatest challenges in embedded development is being able to fit larger applications in a still limited memory footprint, which makes analyzing the binary size of the developed firmware crucial.
Binary Size Analysis
Once you have the resulting binary file of your firmware compiled, we can start analyzing how many bytes each source code component, function, and so on contributes to the overall size. Some freely available, open-source tools to do this analysis are Bloaty and Puncover. Bloaty is a command line tool that profiles and sorts the size of components in your binary at different levels, such as sections, segments and compile units. The results are displayed on the terminal in the form of lists:
Bloaty dynamic_storage.elf -- fixed_storage.elf -d symbols --domain=vm -n 15
VM SIZE
--------------
[NEW] +84 dynamic_storage::store()
[NEW] +46 std::__shared_count<>::~__shared_count()
[NEW] +44 dynamic_storage::~dynamic_storage()
[NEW] +36 fixed_storage::store()
[NEW] +36 store_and_print()
[NEW] +24 vtable for dynamic_storage
[NEW] +24 vtable for fixed_storage
[NEW] +18 fixed_storage::~fixed_storage()
+100% +8 std::_Sp_counted_ptr_inplace<>::_M_dispose()
[NEW] +4 dynamic_storage::cstr()
[NEW] +4 fixed_storage::cstr()
+9.1% +2 std::__cxx11::basic_string<>::_M_dispose()
-5.6% -2 std::_Sp_counted_ptr_inplace<>::~_Sp_counted_ptr_inplace()
-16.1% -36 experiment()
[DEL] -40 std::_Sp_counted_base<>::_M_release()
+2.5% +252 TOTAL
Puncover, on the other hand, spawns a web server so the results can be seen graphically. It creates a File Explorer-like application, where the binary contribution of each source file is shown. Each source file then has its own page, where the user can see the disassembly and a list of symbols ordered by stack, code, or static sizes.
In a way, these tools complement each other. Bloaty gives a higher-level, quick overview for the entire binary, making it faster to identify the parts of your binary that are contributing the most to the size. Puncover can then be used to have a deeper look at the identified components and better understand the changes by comparing the symbols list or direct disassembly differences.
I have made a public repository with different case studies of binary size impact in C++ development, see cpp_binary_size
. The aforementioned tools can be used to compare the resulting binaries within each case and identify the causes for binary size variations.
Going through the examples, we can see the impact of binary size in different aspects of C++ programming. This also means there are different approaches for optimizing binary size during development. Here are some relevant remarks to keep in mind:
Assess the impact at the interface, like constructors and function calls. Look for places where unnecessary copies or casts are being made. Moreover, also analyze how this impact evolves at scale, e.g., when the amount of objects and/or call sites increases. For instance, passing char*
into a function that has std::string
or even std::string&
as its signature will allocate a temporary string. In another instance, using push_back
or emplace_back
to add an element to a vector can either result in a copy or move of that element. Copies usually introduce more code (and thus bigger binary) than moves.
Tune the usage of libraries with compilation flags. When introducing any third-party library, get familiar with its build system and configuration headers and explore how the available options may affect binary size. Most commonly, disabling unused features can yield big savings in binary size. For example, instead of using <format>
from the standard library, fmtlib
, the formatting library <format>
was based on, has many different flags that can reduce binary size.
These are flags that not only disable features, but also use simpler algorithms or trade off smaller binary size for slower performance. Especially in the standard library, testing out different objects and libraries that accomplish similar functionality can have a big impact. In my experiments, I noticed that using functions from <cstdio>
instead of <iostream>
for printing to the stdout saves over 100 kilobytes of binary size, due to iostream bringing with it many static strings and the locale library. A similar memory-saving effect happens also when using the newer <print>
library from C++23.
Optimizing for binary size should not be an afterthought but also influence the design as a whole. When designing the architecture of your application, there are design decisions that can make a great impact in binary size, but we must be mindful of potential downsides in other aspects. For example, using concepts, a C++20 feature, instead of polymorphism, can save a lot of binary size by getting rid of virtual functions, indirections, and bigger destructors. However, it also forces the code base into mostly header files, leading to longer compilation times and also recompilation of all compilation units it affects.
Even within a specific design, it is important to test out variations of how it can be implemented. For example, type erasure is a commonly used pattern in software programming, which can use concrete types through a generic interface in runtime and thus reduce binary bloat. It can be implemented either through inheritance, static functions, or a vtable. We can implement and analyze the binary size impact in an experiment with an increasing number of base objects, and then instances of each object type, and then to look for which variant is the best fit for the size of our application.
Despite all the analysis you do in the source code you write, the target architecture you are planning to deploy can also have a significant impact. For instance, sometimes design variations can yield the same binary size in 64-bit architectures due to its instructions being able to support larger addresses and therefore being less sensitive to changes in function signatures, like the amount of input variables. Meanwhile in ARM, thumb instructions can be either standard or wide, with wide instructions taking up more binary size. Therefore, the compiler might need to employ a different mixture of standard and wide instructions already with smaller function signatures and then yield different binary size footprints for each design variation.
Binary Size Optimization
In the end, C++ is able to generate very efficient code for size-constrained devices. However, tooling and application context are critical, and there are caveats and design decisions to be aware of:
- Find the optimal setup for your build environment: The flags of your toolchain and linker have a great overall effect on the size of the whole application. Sometimes, even rebuilding the toolchain and/or standard library from a source with specialized flags can have a great impact. For example, you can build the standard library with non-default hidden visibility, which can allow the linker to throw away code that is not used in a statically linked firmware.
- Look for the binary size cost of language features and libraries. With Bloaty and Puncover, we can identify symbols and/or files that have binary bloat and can then look for alternatives and optimizations. As we have seen in the last section, using
cstdio
over iostream calls for printing can have a big impact. This is not an isolated incident with the standard library and software design in general, because there are many different ways to implement similar functionality. Algorithms and data structures are other notable features that can have great variance in terms of binary size. - Test out different variations of the same design and analyze their scalability when increasing the number of objects/instances. For instance, avoid excessive use of virtual functions or function calls that require unnecessary copies or casts from its parameters.
- Beware of interactions around types (e.g., polymorphism), constructors (e.g., cost of copy and move constructors and when they are used), and function calls (e.g., when copies and/or allocation of temporaries are needed).
- Prefer simpler algorithms (e.g., for loop over
<algorithm>
suite such asstd::find_if
) and objects/data structures (e.g.,std::map
overstd::unordered_map
), because they usually generate less code. - With template meta programming, a lot of type checking can be deferred to compile type, enabling the user to use simpler, raw types in runtime. This is a common strategy employed by dependency injection frameworks, which can check if the injected types are valid during compilation. Constexpr/consteval contexts can also defer many calculations to compile time and reduce the amount of code to be executed at run time, although it can also add a lot of pre-calculated data to the binary.
- Integrate firmware size as an automated metric in your code analysis/CI pipeline to report the delta with incoming code changes. This reporting allows the development team to keep track of eventual binary size creep with incoming features and act in time.
In Summary
Modern software in microcontrollers represents a challenge not only from the feature development standpoint, but also about being mindful of the impact in terms of binary size and memory footprint. Unlike computing done on desktop and servers, embedded platforms had a very modest evolution in terms of available memory in comparison with their evolution in terms of computing power. Because of this modest memory increase, binary size optimization is a crucial criterion in the whole development lifecycle, from architecture to library selection to implementation.
Using firmware analysis tools and knowledge of the binary size impact of C++ language features are essential to better understand how the impact will scale in your application. In the end, making your application smaller will allow the product to host more features and more powerful use cases in the same firmware envelope.