Hello! Today, we’re open-sourcing Perforator, a continuous profiling system we use at Yandex to analyze the performance of most of our services.
The GitHub repository contains the source code and infrastructure for deploying Perforator on a Kubernetes cluster. You can also run Perforator locally as a simpler alternative to perf record, providing more accurate profiles with lower overhead. The source code is licensed under MIT (with GPL for eBPF programs) and runs on x86-64 Linux.
Using Perforator and our previous profiling approaches, we regularly achieve tens of percent performance improvements in Yandex’s largest services, such as the Ad Delivery System and Yandex Search. Moreover, Perforator fills a critical gap in open-source profiling, enabling simplified automatic program optimization through profile-guided optimization (PGO). Our tests demonstrate that PGO can speed up workloads by approximately 10% in multiple scenarios.
In this article, we will:
- Explore Linux profiling techniques.
- Discuss the challenges of effective profiling.
- Break down how Perforator works under the hood.
- Show how to get the most out of Perforator.
Why Profile Code
Developers often have to figure out why their applications are slow or use too many resources. There can be many reasons for this. For example, there might be a business need for hardware optimization, or the application might require a performance boost to ensure faster response times.
In practice, developers often struggle to pinpoint where the bottlenecks are. Modern computers are too complex for simple mental models to provide a complete picture. That’s where profilers come in—specialized tools that give developers clear insight into what’s happening inside their programs. Using profilers, developers can identify and resolve performance issues, often achieving dramatic speed improvements: websites load faster, applications become more responsive, and companies save а significant amount of money on server infrastructure.
Profilers are so invaluable that seasoned developers strongly advise against optimizing code without first profiling its execution. Quite often, just running a program through a profiler for the first time uncovers a bunch of easy wins—simple optimizations that can boost performance by tens of percent.
Current Profiling Approaches
The software development community has a wealth of profiling tools. We’ve been optimizing code for almost as long as we’ve been writing it. These tools can reveal nearly everything about program execution, ranging from high-level overviews to granular analysis of the instruction stream, such as Intel PT.
Profilers generally fall into two main categories:
-
Instrumentation profilers modify the program in specific ways to facilitate the collection of execution statistics. While powerful, this approach isn’t ideal for profiling many different programs scattered across a large cluster. Instrumentation often introduces significant overhead, and worse, it requires changes to the program’s build process.
-
Sampling profilers. These profilers periodically pause program execution and sample its state. Repeating this process enough times provides a representative profile of how the program runs.
There are also hybrid approaches. Tracy is a great example, as it helps instrument and sample code. However, since instrumentation still has drawbacks, we’ll focus primarily on sampling.
The Poor Man’s Profiler
One of the simplest yet most effective profiling techniques is the Poor Man’s Profiler (PMP). The core idea is to periodically attach to the target program using gdb, capture the stack traces of all threads, and then resume execution. Repeating this process multiple times yields a weighted collection of stack traces.
The more time a program spends in a specific section of the code, the more stack traces from that section you’ll capture. The more stack traces you gather, the more accurate your profile becomes. This approach delivers solid results with minimal effort. For instance, before implementing Perforator, we extensively used PMP in the Ad Delivery System, one of Yandex’s largest services. PMP is a simple tool—just a few hundred lines of Python code—but it saved the company a tremendous amount of resources.
However, PMP has its drawbacks. It’s quite invasive and works best on large clusters; otherwise, gathering enough samples for a representative profile can be challenging. Moreover, PMP primarily profiles only wall time (real-world time). For example, if your program spends 50% of its time sleeping and 50% on the CPU, the profile will reflect both equally. If you need to analyze CPU time or other metrics, such as instruction counts or cache misses, you’ll have to use alternative methods.
Linux perf
The de facto standard for profiling on Linux is Linux perf. It’s a versatile toolkit with a wide range of use cases, though navigating its many options can be challenging. One of the most popular modes is a perf record, a sampling profiler. A perf record uses the perf_event subsystem to subscribe to various system events and collect relevant profiling information, such as stack traces, process names, and thread names, every few events.
For example, the command perf record -e instructions -c 1000000 -p 1234
will capture a snapshot of the thread’s state every million instructions executed in the process with PID 1234. Perf allows profiling of much more than just CPU time. You can also build profiles based on metrics like the number of page faults or cache accesses at specific levels. The only limitation is that the system (for software events like page faults or context switches) or the CPU (for hardware events) has to actually support the specific event. Interestingly, creating a PMP-style wall-time profile with perf isn’t straightforward. You’ll need to merge on-CPU and off-CPU profiles to achieve the same result.
FlameGraphs
To visualize profiles, developers often use FlameGraphs, a powerful and visually intuitive format popularized by profiling guru Brendan Gregg. FlameGraphs offers an elegant, interactive way to pinpoint bottlenecks and easily explore the inner workings of programs.
Continuous Profiling
At one point, Google published a paper on Google Wide Profiler (GWP), a continuous profiler that operates on all servers across multiple data centers. This system offers insights into how programs function at the scale of an entire cluster. Although GWP itself was never open-sourced, the publication and subsequent discussions fostered the growth of an entire profiling field. Under the hood, distributed profilers usually depend on classic single-machine tools such as perf.
These systems gather extensive information about service execution and help answer questions like, “How much money would optimizing this function save?” or even automatically optimize programs. In recent years, Google and Meta have open-sourced numerous tools and methods for automatic post-link optimization based on profiles from sampling profilers, such as BOLT and AutoFDO. These approaches can lead to performance improvements of 10–20%, even over LTO builds.
While these technologies have demonstrated effectiveness in large-scale deployments over the years, their seamless integration into CI/CD pipelines hinges on one missing piece: a distributed profiler.
Thus, we developed Perforator, a distributed continuous profiler for data centers.
Why Build Yet Another Profiler?
The task initially appears straightforward: implement the proven performance metrics, apply them to the entire cluster, utilize straightforward code to gather profiles from all servers, and then somehow combine them. If it were that simple, this article wouldn’t exist. However, as always, the devil is in the details.
Stack Unwinding
So, how does perf work? When you run perf record -e instructions -c 1000000 -p 1234
, perf uses the Linux kernel to configure the CPU to count the number of executed instructions via the PMU (Performance Monitoring Unit).
The PMU is a set of dedicated processor registers that increment with each instruction. When the PMU event counter overflows, the processor triggers a special interrupt. This interrupt is handled by the Linux kernel, which captures a snapshot of the thread’s state at the moment the millionth instruction is executed.
This architecture enables highly precise thread state analysis but requires a significant portion of perf to run within the Linux kernel (!) since the interrupt must be handled in kernel space. For instance, Linux can unwind the thread’s stack using its knowledge of the stack frame organization for the specific architecture.
If you want perf record to collect stack traces, add the --call-graph
flag. However, experimenting with this flag often reveals that the generated profiles can be challenging to interpret. A typical broken profile might look something like this.
This problem arises because modern compilers don’t generate frame pointers by default. While this saves a few instructions per function and frees up a register, it also makes profiling much more difficult. Brendan Gregg provides an excellent breakdown of the problem. A popular solution is to reintroduce frame pointers into the build process. On average, the performance overhead is small, around 1–2%. This approach is commonly used by large companies and Linux distributions.
However, recompiling all programs and libraries with -fno-omit-frame-pointer
is a complex task. Even if the main binary is compiled this way, system libraries often remain compiled with -fomit-frame-pointer
. As a result, stack traces passing through, for example, glibc, end up corrupted. Furthermore, the exact performance loss varies greatly depending on the workload; in some cases, the overhead can be much higher, even reaching double-digit percentages.
DWARF provides an alternative solution for stack unwinding that supports debuggers and exception handling. Compilers generate a .eh_frame
section that encodes the steps to reconstruct the parent stack frame from any instruction in the program, even in programs without exceptions or where exceptions are disabled (for example, in C). You can disable .eh_frame
generation with the -fno-asynchronous-unwind-tables
flag, but in practice, this offers only a slight reduction in executable size while making debugging and profiling much harder.
Perf can utilize DWARF for stack unwinding; you just need to specify the --call-graph=dwarf
flag. However, there’s a crucial detail: DWARF is Turing-complete. Stack unwinding occurs during an interrupt in the Linux kernel, making it virtually impossible to support DWARF in that context. Not only would this necessitate reading DWARF data from disk during an interrupt, but the unwinding logic itself becomes excessively complex, introducing numerous potential bugs.
Linus Torvalds famously conveyed his strong opposition to incorporating DWARF-based unwinding into the kernel:
I never ever want to see this code ever again.
…
Dwarf unwinder had bugs itself, or our dwarf information had bugs,
and in either case it actually turned several “trivial” bugs into a total undebuggable hell.… dwarf is a complex mess …
An unwinder that is several hundred lines long is simply not even remotely interesting to me.
…
just follow the damn chain on the stack without the “smarts” of an inevitably buggy piece of crap.
Therefore, perf employs a different strategy. Instead of unwinding the stack in the kernel, it copies only the top portion of the thread’s stack to user space. This enables stack unwinding to occur in user space, where complex code can run safely.
Of course, copying the entire stack (typically several megabytes per sample) would be too costly. To mitigate this, perf copies only a part of the stack, inevitably leading to some data loss for programs that use it. The maximum stack captured by perf is 65,528 bytes, while the default thread stack size on Linux is 8 MB.
Despite these limitations, this approach generally works well and yields decent profiles. For instance, the profile for the same ClickHouse instance appears as follows:
ZSTD appears to use the stack heavily, and perf’s 65,528-byte limit wasn’t enough.
Moreover, even setting aside stack size limitations, profiles generated with --call-graph=dwarf
are an order of magnitude larger than those generated with --call-graph=fp
. Instead of saving a compact list of return addresses, it has to store the entire stack. As a result, large-scale use of --call-graph=dwarf
requires a lot of resources. A profile collected over just a few dozen seconds can take up gigabytes of space. If we attempt to reduce the stack size limit, the profile quality begins to degrade. Let’s test this scenario:
How Perforator Works
Perf just doesn’t cut it for us. We need to unwind the stack right inside the Linux kernel. So, what are our choices?
-
Patch the kernel. This approach is highly complex. Rolling out kernel patches is difficult, and maintaining them is even more challenging. Plus, stability and security remain significant concerns.
-
Kernel modules. This option is more flexible for our scenario but still raises stability concerns. Interestingly, the first version of Google Wide Profiler worked this way: it relied on OProfile, which implemented a kernel module for profiling before Linux perf existed. However, that method turned out to be pretty fragile and unreliable.
-
eBPF. Fortunately, a powerful technology has emerged: eBPF (Extended Berkeley Packet Filter). It enables small, non-Turing-complete, verifiable programs to execute within the Linux kernel. Currently, eBPF is extensively used for profiling and presents a promising solution to satisfy our needs.
Stack Unwinding with eBPF
Linux allows eBPF programs to run when the PMU triggers an interrupt and provides helper functions for reading memory in both user and kernel space. This gives us the foundational elements to create a perf-like profiler.
The challenge is unwinding the stack using DWARF. As mentioned before, DWARF is complex and, theoretically, Turing-complete (sic!). However, by the late 2010s, the consensus began to shift: This complexity is largely unnecessary in DWARF. The same functionality can be conveyed through much simpler mechanisms (see techniques 1, 2, and 3).
The paper “Reliable and Fast DWARF-Based Stack Unwinding” marked a significant breakthrough. The authors noted that despite DWARF’s overall complexity, compilers typically generate relatively straightforward unwinding rules. This crucial insight opened the door for practical and efficient unwinding implementations.
DWARF CFI
Strictly speaking, DWARF is a family of formats. The one relevant to stack unwinding is DWARF CFI: Call Frame Information. CFI encodes a set of unwinding rules for each instruction, specifying how to determine the position of the parent stack frame—that is, the register values in the parent function. These rules can be arbitrarily complex because they are defined in terms of programs for the DWARF virtual machine.
However, modern compilers typically generate machine code, resulting in remarkably simple unwinding rules. On x86-65, for example, the stack pointer at the time of the current function’s call is almost always calculated as a simple offset from the rsp
or rbp
register. Similar heuristics exist for other architectures.
This means that DWARF CFI can be effectively reduced to a simplified structure, eliminating unnecessary complexity and making it easy to process with eBPF. From there, it’s just a matter of implementation. The resulting system works quite well. The profiler retrieves raw stack traces from eBPF as a series of addresses in the binary file, which can then be further analyzed.
Symbolization
Next, we must convert the collected addresses into function names and source code locations. This process is called symbolization, and it is surprisingly complex. On Linux, this information is typically stored in DWARF, which is notoriously difficult to parse. Parsing DWARF consumes significant memory and CPU resources, so we reserve symbolization for the last moment when the user requests the profile.
To make things even faster, we almost always use GSYM. GSYM is much more compact and provides symbolization structures in a quicker, more resource-efficient manner. Information about GSYM is limited online, but it appears to be supported by developers at Meta, likely for similar internal profiling tasks.
Perforator converts the DWARF data for all binaries into GSYM and then operates exclusively with GSYM. This typically makes symbolization much faster than using DWARF directly, although we can still utilize DWARF as needed.
Overhead
The resulting setup is highly efficient. Currently, we’re continuously profiling CPU cycles across a significant portion of our fleet using Perforator, collecting 100 samples per second per core. The measured slowdown for user processes is minimal, about 0.1%. This low overhead enables us to implement Perforator across the board.
Furthermore, after months of using the system, we’ve never had to exclude any services from profiling, which gives us confidence in this approach’s reliability and stability.
The Perforator agent consumes some of the resources on the machine it runs on. The exact numbers depend heavily on the machine’s size, workload, types of binaries, and sampling frequency. On large hosts, it usually uses about 1% of the host’s CPU and a few gigabytes of memory.
Architecture
Agent
We run the Perforator host agent on every machine in our profiling cluster. It analyzes all processes on the host, profiles the relevant ones (like containers), and sends profiles to the storage backend every minute. The agent also uploads the corresponding binary files along with the profiles, enabling offline symbolization: if the same binary is executed across thousands of servers, it’s far more efficient to symbolize it once on the backend rather than repeatedly on every machine.
Storage
Collected profiles need to be stored somewhere scalable. To this end, we’ve built a storage system using well-known, scalable components. Binary files and minute-level profiles are stored in S3, while metadata about the profiles is kept in ClickHouse. This setup allows us to store petabytes of profiles and quickly locate specific ones using complex selectors in just a few hundred milliseconds.
Backend
On top of the storage layer, we have a backend providing a convenient gRPC interface for users and agents. The backend is split into several microservices, so we can scale specific parts independently as needed.
Technical Details on Code Organization
We’re open-sourcing a tool that’s actively used within our Mono repository. For now, we rely on Yamake as our build system. The code has a dual license: almost everything is under MIT, except for eBPF programs, which are under GPL.
Perforator is built using a mix of languages:
Our flexible build system makes this somewhat unusual combination possible. Moving forward, we plan to simplify the build process by switching to CMake and building wherever we can.
Perforator Features
Native Languages
Perforator offers excellent support for native languages, and that’s where we’ve put most of our optimization effort. We’ve rigorously tested it with C++, C, Go, and Rust. It’s very likely to work well with other, less exotic native languages as well.
Stack Unwinding Challenges
Even though Perforator is pretty robust, it sometimes struggles unwinding stacks. There are two main reasons:
- Handwritten Assembly Code
Stack unwinding can fail when it encounters a handwritten assembly. While compilers allow developers __to annotate __assembly code to generate DWARF CFI, this feature is rarely used. Even when it is, the CFI rules can become complicated. - Disabled
.eh_frame
Generation
Some libraries and executables disable.eh_frame
generation using the-fno-asynchronous-unwind-tables
flag. This makes stack unwinding almost impossible for most tools.
We’re actively working on ways to fix these issues and get as close to perfect stack unwinding as possible.
Interpreted and JIT-Compiled Languages
However, Perforator isn’t confined to native languages. It can also analyze interpreted or JIT-compiled languages across your entire cluster. There are significant advantages to having a unified system: many features can be designed to work generally without being tied to a specific language. For example, Perforator can map its profiles directly back to the source code.
Perforator currently supports several languages, but adding new ones can sometimes require less-than-ideal workarounds, such as tracking internal structure offsets for each Python runtime version. We’ve drawn inspiration from examples like the eBPF subsystem tests in Linux, which collect Python stack traces using eBPF, and tools like py-spy, which produce similar results.
At this time, Perforator can unwind stacks for recent versions of Python (beginning with 3.12). We’re working to extend support for more languages and runtimes, prioritizing enabling Java profiling without modifying JVM launches.
Perforator also supports the de facto standard mechanism for JIT-compiled languages: perf-pid.map
. This format is used by many modern runtimes, though enabling it often requires extra flags when launching the VM. This approach works with Python (3.12 and later), Java, and Node.js, among others.
FDO: Feedback-Driven Optimization
One of Perforator’s key features is its ability to generate profiles for FDO (Feedback-Driven Optimization). FDO is a relatively new optimization mechanism that utilizes runtime profiles from previous versions of your program to enhance performance.
The traditional alternative, PGO (Profile-Guided Optimization), is more cumbersome and, consequently, less widely adopted. With PGO, you must build the binary twice. After the first build, the binary needs to be executed under real-world workloads to collect a profile, which is then used in a subsequent build step.
Perforator greatly simplifies this entire process. You simply collect a profile using Perforator’s API before building, and then provide it to the compiler with the -fprofile-sample-use
flag.
Our benchmarks demonstrate performance gains of up to 10% on our most optimized, high-traffic programs. The documentation provides more details.
Fast FlameGraphs
We initially visualized our profiles using the SVG format from the original __flame graph.__pl. However, as Perforator evolved and gathered more profiling data than a single server could handle, we started encountering limitations with SVG. While these detailed profiles were excellent for in-depth analysis, they were also quite cumbersome and slow to navigate.
Here’s why flamegraph.pl wasn’t suitable for Perforator:
Performance:
Rendering even small profiles with flamegraph.pl could take several seconds, while standard Perforator profiles might take minutes to process.
Data Loss:
Flamegraph.pl aggressively filters out functions that appear less significant, which can compromise the profile’s accuracy. Navigating through different sections of the profile could unintentionally remove crucial information.
Scalability Issues:
Lowering the threshold for excluding functions leads to immense profiles (sometimes reaching gigabytes!) that become non-interactive. Rendering and engaging with such profiles, even on modern hardware like Apple Silicon, can take tens of seconds. The challenge is that SVG represents each function in the profile with multiple DOM nodes, and with hundreds of thousands of functions, the process grinds to a halt, even on powerful machines.
Given the wealth of valuable information produced by the profiler, we aimed to render large profiles effectively. After reviewing existing technologies, we adopted the flame graph format from async-profiler as our foundation and incorporated our own optimizations. The outcome? We can now render a Flame Graph with a million functions in under 100 milliseconds on modern hardware.
This optimized format completely transforms how we utilize FlameGraphs. Exploring and clicking through functions feels incredibly seamless and intuitive, even in extensive programs. For large applications, our FlameGraphs maintain their detailed structure even when you zoom in on rarely executed yet interesting functions.
These comprehensive FlameGraphs are more than just tools for performance tuning. They’re a powerful resource for understanding and exploring codebases. Believe it or not, analyzing a profile can often be a far more effective way to become familiar with a large program than sifting through hundreds of thousands of lines of code. Detailed FlameGraphs significantly simplify and enhance this learning journey.
Local Profiler
Perforator can run in local mode, as a substitute for perf record. This allows you to profile any process or even your entire system on the fly without modifying the code you’re analyzing. Additionally, Perforator can symbolize binaries that lack debug information by utilizing debuginfod. To enable this, simply set the DEBUGINFOD_URLS
environment variable to the appropriate URL for your distribution. For instance, on Ubuntu, use: https://debuginfod.ubuntu.com/. Refer to the documentation for more details.
Wall-Time Profiling
Perforator is not limited to profiling CPU cycles; it can manage a broad range of events, similar to perf. However, as previously mentioned, wall-time profiling can be quite challenging. This type of profiling is crucial for optimizing a program’s response time for users. If a program performs tasks beyond CPU-bound computations—such as reading from disk or waiting for network packets—CPU-cycle-based profiles can easily overlook such activities.
To avoid this issue, the Perforator agent combines the time a thread uses the CPU with the time it waits for I/O. This method isn’t flawless; for instance, if a thread sleeps for several hours, it won’t be captured. However, it performs effectively for most applications.
The resulting profile provides a comprehensive overview of each thread’s activity, illustrating both CPU usage and I/O wait times in their correct proportions. This facilitates the identification and optimization of bottlenecks throughout your entire program.
Slice-Based Profiles: A/B Testing
One of Perforator’s more innovative features is its ability to annotate collected stacks with custom tags from C++ code. Why is this useful? When we modify the behavior of our programs, we often rely on A/B testing. This involves activating a new feature for a small group of users and then monitoring its impact on both them and the program’s performance.
For instance, we might conduct A/B testing to implement a brand-new, resource-intensive machine learning model or a complex algorithm. The new feature could consume significantly more resources; however, A/B tests alone can struggle to pinpoint precisely where performance slowdowns occur.
C++ programs can annotate stacks with thread-local tags—these are strings or numbers that our eBPF program reads and incorporates into the profile. This enables us to generate separate profiles for specific user groups or A/B test cohorts. For example, by writing a unique ID for each request into the tags, we can sift through our system logs to filter the profile and focus exclusively on the requests we are interested in. This approach allows us to obtain accurate profiles for specific A/B tests here at Yandex.
However, this feature significantly increases the profile size since each sample gets a unique key. Previously, we could aggregate multiple samples for a single stack, but with unique tags, such aggregation is no longer feasible.
We used to store profiles in the standard pprof format, but we encountered challenges with size and processing speed. Now, we’ve implemented extensive deduplication for all entities in the profile. Consequently, the introduction of tags only slightly increases the profile size.
Minicores: Lightweight Stack Collection on Fatal Signals
An unexpected yet useful feature of the system is its ability to collect stack traces during fatal signals. Thanks to eBPF’s flexibility, Perforator can hook into the kernel function responsible for delivering signals to threads and efficiently retrieve stack traces whenever a fatal signal (like SIGKILL) is received.
This provides a lightweight way to obtain information about crashes without dealing with large core dumps. In large clusters, saving core dumps for every crash can significantly slow things down, so this is often limited. However, with Perforator, we can obtain stack traces for every crash with minimal overhead.
Even better, by combining this crash stack collection with the ability to read thread-local variables, Perforator can determine precisely which request was being processed at the time of the crash. This is incredibly valuable for runtime services, where understanding the context of a failure is essential for debugging.
System Limitations
Several nuances may potentially restrict the use of Perforator:
- Perforator requires
CAP_SYS_ADMIN
privileges. Through eBPF, it can access arbitrary memory, including kernel space, and it also needs to read arbitrary binary files in the system to construct unwind tables. - Perforator requires a fairly recent Linux kernel—version 5.4 or later. As part of our testing, we ensure the agent works correctly on all long-term support (LTS) kernels from 5.4 onwards. Why this limit? eBPF has evolved rapidly, and earlier versions lack the features required for writing complex programs. While it’s theoretically possible to support versions as early as 4.19, doing so would be a complex and labor-intensive process.
- Currently, Perforator runs only on x86-64 Linux. While ARM support is being developed, it’s not yet ready for release. We hope to include it in the future.
- Although DWARF unwinding works reliably in most cases, it can still fail occasionally. Across our fleet, we’ve observed rare issues unrelated to missing
.eh_frame
data or handwritten assembly. In such cases, enabling-fno-omit-frame-pointer
for specific binaries can help resolve the problem.
The Perforator agent may require a significant amount of anonymous memory to store unwind tables—several gigabytes for large hosts. We have some good ideas for improving this, and we aim to reduce memory usage to under a gigabyte in most cases.
How to Try Perforator
You can try Perforator in two ways:
- Simple Option: Local Run
Download or build thePerforator binary, then run:sudo perforator record -a --duration=60s
. After one minute, Perforator will automatically open the collected system profile in your browser. - Advanced Option: Kubernetes Cluster
Deploying Perforator on a Kubernetes cluster is more involved, but we’ve made the process as straightforward as possible. With just a few commands, you can get a fully working setup. The__documentation__ provides a step-by-step guide.
We believe that fundamental system technologies such as operating systems, compilers, system libraries, debuggers, and profilers should be open-source and developed in collaboration with the community. Our work can bring value to developers and businesses alike, and by open-sourcing this project, we can create better profiling tools and shape the future of this field together with you.
For example, we recognize the need to develop new shared formats: a unified format for storing stack unwinding rules as an alternative to DWARF CFI, a format for debug information for symbolization (GSYM), and a standard profile storage format. We plan to contribute our experience to developing a universal profile format within OpenTelemetry.
Perforator is available as an open-source project on GitHub. Documentation and installation instructions can be found at perforator.tech.
We are actively developing Perforator, so there may be some rough edges initially—but we’ll address them! We’d love for you to try Perforator and share your feedback. Feel free to open GitHub Issues or, even better, submit Pull Requests.