Show full content
2024 will be an interesting year for software performance based on what I read a few days ago: “Data-type profiling for perf“:
- “Tooling for profiling the effects of memory usage and layout has always lagged behind that for profiling processor activity, so Namhyung Kim’s patch set for data-type profiling in perf is a welcome addition. It provides aggregated breakdowns of memory accesses by data type that can inform structure layout and access pattern changes. Existing tools have either, like heaptrack, focused on profiling allocations, or, like perf mem, on accounting memory accesses only at the address level. This new work builds on the latter, using DWARF debugging information to correlate memory operations with their source-level types.”
There’s also the presentation of the author from November 2023:
- “Memory accesses can suffer from problems like poor spacial and temporal locality, as well as false sharing of cache lines. Existing presentations of profile data, such data from the perspective of code, can make it difficult to reason as to what the problems are and to work out what the fixes should be. A typical fix may be to reorder variables within a data structure. In this work Namhyung Kim will present ongoing work combining perf event and DWARF debug information, in order to correlate samples and present data type of the variables accessed within a program. However, DWARF debug information is not reliable in enabling a good understanding of variables accessed. The presentation will discuss the state of data type profiling and its addition to the Linux perf tool, how toolchain limitations are worked around by the tool, and how toolchains can be improved for data type profiling in the future.”
A few more relevant tools for people doing performance work:
- FTrace: Ftrace (Function Tracer) is an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel. It can be used for debugging or analyzing latencies and performance issues that take place outside of user-space. Although ftrace is typically considered the function tracer, it is really a framework of several assorted tracing utilities. There’s latency tracing to examine what occurs between interrupts disabled and enabled, as well as for preemption and from a time a task is woken to the task is actually scheduled in. One of the most common uses of ftrace is the event tracing. Throughout the kernel is hundreds of static event points that can be enabled via the tracefs file system to see what is going on in certain parts of the kernel.
- Coz: Finding Code that Counts with Causal Profiling. Coz is a profiler for native code (C/C++/Rust) that unlocks optimization opportunities missed by traditional profilers. Coz employs a novel technique called causal profiling that measures optimization potential. It predicts what the impact of optimizing code will have on overall throughput or latency. Profiles generated by Coz show the “bang for buck” of optimizing a line of code in an application. In the below profile, almost every effort to optimize the performance of this line of code directly leads to an increase in overall performance, making it an excellent candidate for optimization efforts.
- eBPF: acronym for extended Berkeley Packet Filter is a technology that can run programs in a privileged context such as the operating system kernel. It can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.
- perf c2c: Use for “Shared Data Cache-to-Cache (C2C)” analysis. See also “Chapter 26. Detecting false sharing” in Red Hat documentation:
- False sharing occurs when a processor core on a Symmetric Multi Processing (SMP) system modifies data items on the same cache line that is in use by other processors to access other data items that are not being shared between the processors.
- This initial modification requires that the other processors using the cache line invalidate their copy and request an updated one despite the processors not needing, or even necessarily having access to, an updated version of the modified data item.
- You can use the
perf c2ccommand to detect false sharing.
- Perfetto: System profiling, app tracing and trace analysis
- Linux kernel tracing: Capture high frequency ftrace data: scheduling activity, task switching latency, CPU frequency and much more.
- Userspace profilers and extra probes: Native heap profiling, Java heap profiling, pollers for /proc stat files.
- Intel® VTune
Profiler: optimizes application performance, system performance, and system configuration for HPC, cloud, IoT, media, storage, and more. Multilingual: C, C++, C#, Fortran, Python, Go, Java*, .NET, Assembly, or any combination of languages.
See also:
- Upcoming FOSDEM talk “Profiling Python with eBPF: A New Frontier in Performance Analysis“
- “Understanding Request Latency with Profiling“: a detailed technical article about Datadog’s Java wallclock profiler, exploring how to improve request latency without making any code changes, or even seeing the code.
- Book: “Learning eBPF: Programming the Linux Kernel for Enhanced Observability, Networking, and Security“
- Book: “Systems Performance: Enterprise and the Cloud“
- Book: “Understanding Software Dynamics“











