PhaseRush.

Benched

Cover Image for Benched

We've all heard these being thrown around in an abundance of algorithmic arguments:

Arrays are faster than Lists!
Collection x is better than collection y!
Loops are faster than Streams!
Iterating over 2d arrays is faster in row major order!
Bogosort is faster than Quicksort!

Ok, no one has ever said that last line unironically (I hope), but have you ever wanted a way to prove - or be proven wrong about - the other statements?

Constantly being challenged and asking questions regarding algorithmic performance lead me to start this project. I wanted a definitive way to test the theoretical knowledge I learned in various classes, and solidify and explain the phenomena I observed in my coding adventures. I've also changed hardware multiple times throughout the life of this project, and it's always cool seeing little leaps in performance when running the same benchmark on better hardware.

There are too many benchmarks to outline them all, so I'll just point to a couple that I think really highlight something beautiful in computers, from physical CPU design all the way up to abstract API calls in the language.

0. Contiguous memory and CPU pre-fecthing

The principle of sequential locality refers to data which is arranged adjacently, thus allowing for efficient CPU caching when accessing the data. Arrays in C-family languages (ie. Java), are arranged in row-major order.

Looking at the data, it's obvious that the row-major operations execute significantly faster than their column-major counterparts.

Benchmark(N)ModeCntScoreErrorUnits
columnMajorSum10avgt50.0390.001us/op
rowMajorSum10avgt50.0220.001us/op
columnMajorSum100avgt53.1540.012us/op
rowMajorSum100avgt52.0700.005us/op
columnMajorSum1000avgt5958.7677.110us/op
rowMajorSum1000avgt5208.4410.390us/op
columnMajorSum10000avgt5596847.9429710.131us/op
rowMajorSum10000avgt531717.749899.836us/op

Notice that with N=10000, the row-major is almost 20x faster than the column-major implementation while there is less discrepancy with smaller N. The reason for a discrepancy is due to the cpu's ability to pre-fetch the array elements, which are all cache hits for row-major while they're cache misses for column-major. However, the discrepancy suddenly jumps to 20x at N=10000 because the CPU must go to RAM frequently when the desired memory is not in cache, drasticaly slowing it down.

This particular benchmark was also run with various CPU and RAM speed configurations, with the intent of showing how the speed and size of the cache and ram affected this very memory sensitive measurement. This particular benchmark was also run with various CPU and RAM speed configurations, with the intent of showing how the speed and size of the cache and ram affected this very memory sensitive measurement. This particular benchmark was also run with various CPU and RAM speed configurations, with the intent of showing how the speed and size of the cache and ram affected this very memory sensitive measurement.

1. Number of digits in a number

There are hundreds of ways to solve this problem. I've chosen the three below:

  1. Convert to a string then take it's length
String.valueOf(num).length()
  1. Use Math to figure out the exponent, which is the number of digits
num == 0 ? 1 : (1 + (int) Math.floor(Math.log10(Math.abs(num))))
  1. Binary search the number of digits
num < 100000 ? num < 100 ? num < 10 ? 1 : 2 : num < 1000 ? 3 : num < 10000 ? 4 : 5 : num < 10000000 ? num < 1000000 ? 6 : 7 : num < 100000000 ? 8 : num < 1000000000 ? 9 : 10

To illustrate how the speed of each solution varies with input, each solution was run with a few inputs:

  1. int_5, 5 digit int
  2. int_9, 9 digit int
  3. double_5, 5 digit double
  4. double_300, 300 digit double
BenchmarkModeCntScoreErrorUnits
double300Logavgt57.679± 0.040ns/op
double300StringLengthavgt5627.809± 4.211ns/op
double300Logavgt57.6790.040ns/op
double300StringLengthavgt5627.8094.211ns/op
double5Logavgt57.7000.073ns/op
double5StringLengthavgt563.5830.843ns/op
int5IfNestavgt51.5600.011ns/op
int5Logavgt58.0260.039ns/op
int5StringLengthavgt57.3080.039ns/op
int9IfNestavgt51.5600.024ns/op
int9Logavgt58.0310.048ns/op
int9StringLengthavgt510.6210.041ns/op

Everyone is taught in school that String manipulation is slower than math (for equivalent operations), but that's not always the case. We can see that the for integers, the String method is as fast as the log method. Math functions can be more expensive than you think.

And as a surprise to some, the binary search method worked the fastest by far, across the board wherever it was applicable. This solution employes O(log10(n)) number comparisons, which is far faster than an O(n) String creation and the overhead of the math function.

Another point of interest is to see that the <code>log</code> solution's execution time does not scale with the size of the number, remaining around 8 nanoseconds for numbers that range from 5 to 300 digits. On the contrary, the String solution did scale with the size, as we would expect.

2. String concatenation

If you concatenate Strings in a loop like this:

String acc = "";
for (var x : collection) {
    acc += x;
}

I've got some big news for you.

Strings in Java are immutable. This means that they cannot be modified after creation. Here's an example of String#substring(), showing that a new instance of String is being created (instead of a view of the current String).

// java.lang.String
    public String substring(int beginIndex, int endIndex) {
        int length = length();
        checkBoundsBeginEnd(beginIndex, endIndex, length);
        if (beginIndex == 0 && endIndex == length) {
            return this;
        }
        int subLen = endIndex - beginIndex;
        return isLatin1() ? StringLatin1.newString(value, beginIndex, subLen)
                          : StringUTF16.newString(value, beginIndex, subLen);
    }

This means that mutating a String in any way (append, prepend, change character) involves making a completely new String with the modification applied. Thus, any String mutation operation (such as concatenation in a loop) is incredibly slow as new memory must be allocated and destroyed on every iteration.

I tested 6 implementations:

  1. The "classic" appending to a String in a loop.
    @Benchmark
    public void stringConcat(Blackhole bh) {
        String res = "";
        for (int i = 0; i < N; i++) {
            res += unit;
        }
        bh.consume(res);
    }
  1. Using StringBuffer, like the javadocs for the String class mentions.
    @Benchmark
    public void stringBuffer(Blackhole bh) {
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < N; i++) {
            sb.append(unit);
        }
        bh.consume(sb);
    }
  1. Using StringBuilder
    @Benchmark
    public void stringBuilder(Blackhole bh) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < N; i++) {
            sb.append(unit);
        }
        bh.consume(sb);
    }
  1. Cheating with String#repeat() to show the power of pre-allocation
    @Benchmark
    public void stringRepeat(Blackhole bh) {
        bh.consume(unit.repeat(N));
    }
  1. Tossing in a Stream.reduce() for fun
 @Benchmark
    public void streamReduce(Blackhole bh) {
        bh.consume(IntStream.range(0, N)
        .mapToObj(i -> unit)
        .reduce(String::concat));
    }
  1. ... then making it parallel
    @Benchmark
    public void parallelStreamReduce(Blackhole bh) {
        bh.consume(IntStream.range(0, N)
        .parallel() // the magic sauce
        .mapToObj(i -> unit)
        .reduce(String::concat));
    }

Let's look at the results, then break it down.

Benchmark(N)ModeCntScoreErrorUnits
stringConcat100000avgt58616.315235.166ms/op
stringBuffer100000avgt51.2320.030ms/op
stringBuilder100000avgt50.9500.048ms/op
stringRepeat100000avgt50.1630.004ms/op
streamReduce100000avgt56599.168249.764ms/op
parallelStreamReduce100000avgt5112.5565.038ms/op

First of all: the naive string concatenation in a loop is the worst performer. As mentioned, this is due to Strings being immutable, so each iteration entails creating and deleting memory.

StringBuffer and StringBuilder perform similarly, with StringBuilder pulling ahead slightly. This is due to StringBuffer being synchronized, which includes a tiny bit of overhead, which we have measured.

String#repeat() is a method that was introduced in Java 11. As you would expect, this method calculates the final size of the String ahead of time, and allocates the entire memory together. Then, it uses fast native calls to copy/duplicate the data, with a clever loop to copy at an exponential rate.

    // java.lang.String
    public String repeat(int count) {
        // skipping some base cases and other fluff here ...
        final int limit = len * count;
        final byte[] multiple = new byte[limit]; // allocate in one go
        System.arraycopy(value, 0, multiple, 0, len);
        int copied = len;
        // the <<= 1 call allows for log(n) iterations
        for (; copied < limit - copied; copied <<= 1) {
            System.arraycopy(multiple, 0, multiple, copied, copied);
        }
        System.arraycopy(multiple, 0, multiple, copied, limit - copied);
        return new String(multiple, coder);
    }

As expected, String#repeat() is by far the fastest, outperforming even StringBuilder by a factor of 4. However, it's worth noting that this method only works for repeating Strings, while the other methods work without any limitation on the String.

The sequential Stream performed slowly, though still faster than the naive iterative concatenation. The parallel implementation was able to achieve a 100x speedup over the sequential stream, though it's still 100x slower than the StringBuilder.

Takeaway: don't naively concatenate Strings in a loop.

3. How fast can you deal your Deque?

Let's take a little break and do a quick one.

ArrayDeque is faster than LinkedList.

I ran a lot of benchmarks for this comparison, but let's just take a look at a curated list.

Benchmark(N)ModeCntScore (us/op)Error (us/op)
arrayDeque init10avgt5` 1.1010.032`
linkedList init10avgt5` 1.0790.013`
arrayDeque init100000000avgt5` 682196.4801688984.231`
linkedList init100000000avgt5`7181044.91010077727.767`
arrayDeque traverse10avgt5` 1.0760.038`
linkedList traverse10avgt5` 1.0710.002`
arrayDeque traverse100000000avgt5` 433594.00499847.736`
linkedList traverse100000000avgt5`1354218.609957037.882`
arrayDeque removeAll10avgt5` 1.0650.002`
linkedList removeAll10avgt5` 1.0720.008`
arrayDeque removeAll100000000avgt5` 449730.860547815.400`
linkedList removeAll100000000avgt5`1701792.3921759202.243`

From these results, I can safely recommend ArrayDeque over LinkedList to anyone who is looking to use a Queue or Stack data structure. ArrayDeque is backed by an array, while LinkedList is backed by nodes which need to be allocated and deallocated every time. Even considering ArrayDeque's need to grow and shrink the backing array, the amortized cost is low enough such that the overall performance is substantially better than LinkedList.

4. How fast can you sum a list of numbers?

For the finale, I'll present the answer to a question my friends were asking:

How fast can you sum a list of numbers? Let's say we're working with an ArrayList of type Double.

Let's outline the solutions:

  1. The "classic" adding to an accumulator in a loop. This is what your typical student will come up with.
    @Benchmark
    public void forLoop(Blackhole bh) {
        double sum = 0;
        for (double num : list) {
            sum += num;
        }
        bh.consume(sum);
    }
  1. DoubleAdder is a class that seems to be made for this. Let's see how it performs.
    @Benchmark
    public void doubleAdder(Blackhole bh) {
        DoubleAdder a = new DoubleAdder();
        list.parallelStream().forEach(a::add);
        bh.consume(a.doubleValue());
    }
  1. Mapping to a double then summing
    @Benchmark
    public void mapToDoubleSum(Blackhole bh) {
        double sum = list.stream()
                         .mapToDouble(Double::doubleValue)
                         .sum();
        bh.consume(sum);
    }

There is another equivalent variation tested which instead used the primitive identity map:

    @Benchmark
    public void mapToDoubleSum_prim(Blackhole bh) {
        double sum = list.stream()
                         .mapToDouble(i -> i)
                         .sum();
        bh.consume(sum);
    }

These two performed identically, so only one will be represented in the final table.

  1. Mapping to a double then summing, but parallelized. Similar to the sequential stream, an equivalent variation using the primitive identity map performed identically so I won't show it here.
    @Benchmark
    public void mapToDoubleSumParallel(Blackhole bh) {
        double sum = list.parallelStream()
                         .mapToDouble(Double::doubleValue)
                         .sum();
        bh.consume(sum);
    }
  1. Reducing a stream by Double's summing function
    @Benchmark
    public void reduceSumDouble(Blackhole bh) {
        double sum = list.stream()
                         .reduce(0.0, Double::sum);
        bh.consume(sum);
    }
  1. ... but parallelized
    @Benchmark
    public void reduceSumDoubleParallel(Blackhole bh) {
        double sum = list.parallelStream()
                         .reduce(0.0, Double::sum);
        bh.consume(sum);
    }
  1. Using Collector.summingDouble()
    @Benchmark
    public void summingCollectorDouble(Blackhole bh) {
        // if not for the noinspection comment, IntelliJ IDEA would simplify this call to 
        // double sum = list.stream().mapToDouble(i -> i).sum();
        //noinspection SimplifyStreamApiCallChains
        double sum = list.stream()
                         .collect(Collectors.summingDouble(i -> i));
        bh.consume(sum);
    }
  1. ...
    public void summingCollectorDoubleParallel(Blackhole bh) {
        //noinspection SimplifyStreamApiCallChains
        double sum = list.parallelStream()
                         .collect(Collectors.summingDouble(i -> i));
        bh.consume(sum);
    }

Let's have a look at a curated set of results.

Benchmark(N)ModeCntScore (us/op)Error (us/op)
forLoop10avgt100.0060.001
doubleAdder10avgt105.5510.034
mapToDoubleSum10avgt100.0570.001
mapToDoubleSumParallel10avgt105.6420.040
reduceSumDouble10avgt100.0540.001
reduceSumDoubleParallel10avgt105.5900.031
summingCollectorDouble10avgt100.0630.001
summingCollectorDoubleParallel10avgt105.8790.013
forLoop10000000avgt1010899.721114.039
doubleAdder10000000avgt107247.29520.757
mapToDoubleSum10000000avgt1035498.33687.104
mapToDoubleSumParallel10000000avgt107246.92211.496
reduceSumDouble10000000avgt1029835.707106.342
reduceSumDoubleParallel10000000avgt1022112.73518.310
summingCollectorDouble10000000avgt1045315.357375.390
summingCollectorDoubleParallel10000000avgt107275.26193.681

That's a lot of data to take in. Let's break it down.

As expected, the for loop performs the fastest when there are only 10 elements. The minimal overhead allows the simple manual loop accumlation to blow (almost) everything out of the water. Surprisingly Collector.summingDouble() performs almost as well, which is surprising given that it theoretically has some overhead. However, when we increase the element count to 1 million, this implementatino performs the worst; I'm not exactly what changed to cause this discrepancy, but it's quite curious.

DoubleAdder and the parallel stream implementations had the slowest runtimes for N=10, due to their relatively high overheads of about 100x their sequential counterparts. However, it pays off as they have the fastest runtimes for N=1E6, all ending up 30% faster than even the "overhead-less" for loop, except for the Stream.reduce() solution, which seems to struggle to scale as N increases. This shows the importance of the constant factor in algorithmic analysis. Even though everyone is caught up with big O notation, the constant factors are dominant over higher powers at low N. Only once we get into "asymptotic" territoriy of N=1E4 do we start seeing the constants falling off compared to dominant terms.

The overall lesson this shows is that the naive for loop is not a bad choice; for novice programmers, the time saved from not figuring out the other implemtations would likely result in more savings than the difference in runtime. However, the benchmark does also show that there are methods that can beat simpler solutions, so understanding how things work under the hood and leveraging hardware can result in faster code.

That was just a selection of my favourite benchmarks.

There are many, many more benchmarks in the repository, and I'm still adding benchmarks which answer questions that piqued my interest. This is an ever growing set of data that I'll continue to contribute to in the pursuit of scratching that performance itch.

And if you actually read this wall of text to this point, you're probably interested in performance as much as I am. Tag along and feel free to contribute :)