Chronicle of Optimization with perf and C
Rival Fortress Update #36
Note: You can find the source code for the optimized version of poly2tri on Github.
I needed a fast and robust mesh triangulation solution for Metric Panda Engine and after some research I found the excellent Poly2Tri library, an implementation in C++ of the Sweep-line algorithm for constrained Delaunay triangulation.
Metric Panda Engine is written in C-style C++, so I wanted to port the code for poly2tri and customize it to fit the engine’s needs.
While I ported the code I noticed opportunities for optimization, and in this post I’ll talk about what I did.
Preparing the library for profiling
Before I went ahead and made changes I wanted a baseline profile measurement.
Fortunately Poly2tri comes with a testbed with sample data that can be used to test the output of the library. The test application accepts a data file containing a list of point as input, runs the triangulation routine and displays the result using OpenGL.
The part I was interested in optimizing was the triangulation routine, so everything else that the test application did was noise that I had to minimize in order to get better readings while profiling.
To do so I modified the
main.cc and did the following:
- Removed the OpenGL visualization part
- Isolated all library related function calls and placed them in a loop to reduce the noise from the startup of the testbed application
The relevant part of the refactored testbed looks like this:
Note: The snippet above contained a call to
delete in the initial version of the post. Creating the CDT on the stack saves the call to
malloc and shaves off a little time from the benchmarked times.
Bird’s eye view with
With Poly2tri ready to be profiled I jumped to my trusty perf, the profiling swiss army knife every Linux dev should have in their toolbox. On Windows, the best solution in probably Intel’s VTune, but I don’t have any experience with it.
perf subcommand I always start with is stat:
taskset 2 perf stat --detailed --repeat 10 ./main testbed/data/debug2.dat 0 0 1
Several things are happening in this one-liner:
taskset 2will force the perf execution to happen on the second CPU core and avoid any random core migration that may muddy up the results. This can be safely done when dealing with single-threaded code, like it’s the case for Poly2tri.
perf statcommand runs the Poly2tri executable (
./main) passing it
debug2.dat, a test data file containing 10000 points to triangulate.
--detailedflag will cause perf to output more counters.
--repeat 10flag reruns the executable ten times and averages the results as well as printing the standard deviation between each run.
0 0 1at the end are arguments that Poly2tri needs for the OpenGL visualization, but are not used in the modified testbed.
This is the result with the interesting parts highlighted:
The first highlighted row shows the execution time in milliseconds. The rightmost column is the standard deviation between runs. A low standard deviation means the program is stable across runs.
The second highlighted row shows the instructions per cycle, an important metric that can give you hints on whether the CPU is stalling during execution.
The Intel® 64 and IA-32 Architectures Optimization Reference Manual defines this metric as Clocks per Instructions Retired Ratio (CPI), and uses the inverse of what
perfshows. The theoretical best on modern superscalar CPUs executing at full speed is 4 instructions per cycle. In the case of Poly2tri, 1.29 instructions per cycle indicate possible front-end or back-end stalls. A deeper analysis is required. Take a look at Tuning Applications Using a Top-down Microarchitecture Analysis Method for more on the subject.
The third highlighted row shows the number of L1 cache misses. Perf highlighted the percentage in yellow to indicate that this may be a potential hotspot and should be investigated further.
Finding the hot spots with
Now that I had a baseline to work with, I used the next
perf subcommand: record.
taskset 2 perf record -e cpu-clock,L1-dcache-load-misses ./main testbed/data/debug2.dat 0 0 1
The previous command will sample the execution of the program and read the event counters specified by the
cpu-clock: measures the time spent in a function. This is useful for gauging the amount the functions we should be focusing on.
L1-dcache-load-misses: will show the functions that triggered the most L1 cache misses
The recorded data can be viewed with
perf report and looks like this:
The first set represents the top 5 functions sorted by
cpu-clock with the percentage column on the left that indicates the amount of time spent overall. The second set of functions represent where L1 cache misses happened.
The functions of interest that appear in both sets are:
p2t::Triangle::MarkNeighbor: (see on Github) marks a triangle as a neighbor if they share two points along an edge.
__ieee754_atan2_avx: (see on Github) is the
libmimplementation of the atan2 function.
p2t::AdvancingFront::LocateNode: (see on Github) traverses the doubly-linked-list of the advancing front looking for the next node to process.
Digging deeper with
Now that we have a few hot spots to look at, we can fire up another perf subcommand: annotate.
Actually, instead of calling annotate directly, I find it easier to jump to it from
perf report by selecting the function of interest and pressing
Cycling through the hottest instructions in the report for
p2t::Triangle::MarkNeighbor (by pressing
TAB), most of hot spots are caused by this:
It corresponds to the function
p2t::Triangle::Contains (see on Github) that is inlined by the optimizer.
The reason the
cmp %rcx,%rdx takes up so many cycles is probably because of a CPU stall caused by either a cache miss or another dependent instruction.
Looking the report for
L1-dcache-load-misses seems to confirm the suspicion:
The other function on the hot spot list,
p2t::AdvancingFront::LocateNode, also suffers from a similar issue: the CPU is having to wait for memory access because of heavy usage of doubly linked lists.
Ways to optimize these sorts of CPU stalls involve improving the cache locality of the data (i.e. keep the data that needs to be worked on closer together in memory). This can’t always be done directly if the underlying algorithm has certain constraints, but often times there is a better way.
An excellent video on the subject is Mike Acton’s great Data-Oriented Design and C++ talk.
Time to Optimize
Before I went about focusing on optimizing cache locality, I wanted to go for the low hanging fruits.
For the first optimization pass I decided to remove all dynamic allocations caused by calls to
new. I never use dynamic allocation in performance critical code, and instead use many types of custom allocators depending on the needs.
I replaced these allocations with a simple fixed-size push allocator that partitions a large memory block that is passed in to the library during initialization. The push allocator looks something like this:
As you can see, it’s very simple and is lightning fast compared to
This optimization resulted in a ~39% performance increase from the baseline on the
debug2.dat file and is proportional to the number of points the library has to process: the more points in the data set there are, the slower dynamic allocations will become.
Doubles to floats
The next optimization I had on my checklist was changing the type for floating point from
My assumption was that since the precision of 64bit floating point numbers was not needed for this type of library, why not use
floats and allow the CPU potentially increase bandwidth and parallelization.
The result was a tiny improvement in performance, mostly because the algorithm doesn’t do any wide operations of floating point numbers that would benefit from the increased bandwidth.
Faster Atan2 function
As seen above, the atan2 is one of the hot spots of Poly2tri, so I tried to replace it with an approximate version of the function with a good enough error.
The implementation I ended up with is based on this fast atan2 approximation and has a maximum error of 0.005 radians. That is perfectly acceptable for my needs.
Optimizing Cache Misses
The last item on my list was cache misses.
While working on the previous optimizations I gained a better understanding on the algorithm Poly2tri is aimed at implementing. The heavy usage of linked lists and double pointers was a sensible decision by the original implementer, and honestly I can’t really thing of a more performant way of doing things.
In the end I tried my best to reduce cache pressure with micro optimizations here and there. The final result is acceptable, but code in some sections may have become more cryptic for people trying to understand the algorithm.
This is what I did:
- Replaced the
std::vectorto the Edges in
pt2::Pointto a linked list. The edges initialized are once and not accessed often, so this change allowed for easier memory management.
- Replaced the
pt2::Triangle(see on Github) with an unsigned int field that is used as a bit mask. The advantage is that bit operations are very cheap and avoid may of the branches used in the original implementation.
- Removed many of the branches that caused cache misses like the one showcased earlier.
- Used a global lookup table for faster access of triangle array indices when rotating left or right from a point.
The final, optimized, result on the
debug2.dat sample data with 100 iterations resulted in ~100% speed increase: