Benched
We've all heard these being thrown around in an abundance of algorithmic arguments:
Arrays are faster than Lists!
Collection x is better than collection y!
Loops are faster than Streams!
Iterating over 2d arrays is faster in row major order!
Bogosort is faster than Quicksort!
Ok, no one has ever said that last line unironically (I hope), but have you ever wanted a way to prove - or be proven wrong about - the other statements?
Constantly being challenged and asking questions regarding algorithmic performance lead me to start this project. I wanted a definitive way to test the theoretical knowledge I learned in various classes, and solidify and explain the phenomena I observed in my coding adventures. I've also changed hardware multiple times throughout the life of this project, and it's always cool seeing little leaps in performance when running the same benchmark on better hardware.
There are too many benchmarks to outline them all, so I'll just point to a couple that I think really highlight something beautiful in computers, from physical CPU design all the way up to abstract API calls in the language.
0. Contiguous memory and CPU pre-fecthing
The principle of sequential locality refers to data which is arranged adjacently, thus allowing for efficient CPU caching when accessing the data. Arrays in C-family languages (ie. Java), are arranged in row-major order.
Looking at the data, it's obvious that the row-major operations execute significantly faster than their column-major counterparts.
Benchmark | (N) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
columnMajorSum | 10 | avgt | 5 | 0.039 | 0.001 | us/op |
rowMajorSum | 10 | avgt | 5 | 0.022 | 0.001 | us/op |
columnMajorSum | 100 | avgt | 5 | 3.154 | 0.012 | us/op |
rowMajorSum | 100 | avgt | 5 | 2.070 | 0.005 | us/op |
columnMajorSum | 1000 | avgt | 5 | 958.767 | 7.110 | us/op |
rowMajorSum | 1000 | avgt | 5 | 208.441 | 0.390 | us/op |
columnMajorSum | 10000 | avgt | 5 | 596847.942 | 9710.131 | us/op |
rowMajorSum | 10000 | avgt | 5 | 31717.749 | 899.836 | us/op |
Notice that with N=10000
, the row-major is almost 20x faster than the column-major implementation while there is less discrepancy with smaller N
. The reason for a discrepancy is due to the cpu's ability to pre-fetch the array elements, which are all cache hits for row-major while they're cache misses for column-major. However, the discrepancy suddenly jumps to 20x at N=10000
because the CPU must go to RAM frequently when the desired memory is not in cache, drasticaly slowing it down.
This particular benchmark was also run with various CPU and RAM speed configurations, with the intent of showing how the speed and size of the cache and ram affected this very memory sensitive measurement. This particular benchmark was also run with various CPU and RAM speed configurations, with the intent of showing how the speed and size of the cache and ram affected this very memory sensitive measurement. This particular benchmark was also run with various CPU and RAM speed configurations, with the intent of showing how the speed and size of the cache and ram affected this very memory sensitive measurement.
1. Number of digits in a number
There are hundreds of ways to solve this problem. I've chosen the three below:
- Convert to a string then take it's length
String.valueOf(num).length()
- Use Math to figure out the exponent, which is the number of digits
num == 0 ? 1 : (1 + (int) Math.floor(Math.log10(Math.abs(num))))
- Binary search the number of digits
num < 100000 ? num < 100 ? num < 10 ? 1 : 2 : num < 1000 ? 3 : num < 10000 ? 4 : 5 : num < 10000000 ? num < 1000000 ? 6 : 7 : num < 100000000 ? 8 : num < 1000000000 ? 9 : 10
To illustrate how the speed of each solution varies with input, each solution was run with a few inputs:
int_5
, 5 digit intint_9
, 9 digit intdouble_5
, 5 digit doubledouble_300
, 300 digit double
Benchmark | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|
double300Log | avgt | 5 | 7.679 | ± 0.040 | ns/op |
double300StringLength | avgt | 5 | 627.809 | ± 4.211 | ns/op |
double300Log | avgt | 5 | 7.679 | 0.040 | ns/op |
double300StringLength | avgt | 5 | 627.809 | 4.211 | ns/op |
double5Log | avgt | 5 | 7.700 | 0.073 | ns/op |
double5StringLength | avgt | 5 | 63.583 | 0.843 | ns/op |
int5IfNest | avgt | 5 | 1.560 | 0.011 | ns/op |
int5Log | avgt | 5 | 8.026 | 0.039 | ns/op |
int5StringLength | avgt | 5 | 7.308 | 0.039 | ns/op |
int9IfNest | avgt | 5 | 1.560 | 0.024 | ns/op |
int9Log | avgt | 5 | 8.031 | 0.048 | ns/op |
int9StringLength | avgt | 5 | 10.621 | 0.041 | ns/op |
Everyone is taught in school that String manipulation is slower than math (for equivalent operations), but that's not always the case. We can see that the for integers, the String method is as fast as the log method. Math functions can be more expensive than you think.
And as a surprise to some, the binary search method worked the fastest by far, across the board wherever it was applicable. This solution employes O(log10(n))
number comparisons, which is far faster than an O(n)
String creation and the overhead of the math function.
Another point of interest is to see that the <code>log</code> solution's execution time does not scale with the size of the number, remaining around 8 nanoseconds for numbers that range from 5 to 300 digits. On the contrary, the String solution did scale with the size, as we would expect.
2. String concatenation
If you concatenate Strings in a loop like this:
String acc = "";
for (var x : collection) {
acc += x;
}
I've got some big news for you.
Strings in Java are immutable. This means that they cannot be modified after creation. Here's an example of String#substring()
, showing that a new instance of String is being created (instead of a view of the current String).
// java.lang.String
public String substring(int beginIndex, int endIndex) {
int length = length();
checkBoundsBeginEnd(beginIndex, endIndex, length);
if (beginIndex == 0 && endIndex == length) {
return this;
}
int subLen = endIndex - beginIndex;
return isLatin1() ? StringLatin1.newString(value, beginIndex, subLen)
: StringUTF16.newString(value, beginIndex, subLen);
}
This means that mutating a String in any way (append, prepend, change character) involves making a completely new String with the modification applied. Thus, any String mutation operation (such as concatenation in a loop) is incredibly slow as new memory must be allocated and destroyed on every iteration.
I tested 6 implementations:
- The "classic" appending to a String in a loop.
@Benchmark
public void stringConcat(Blackhole bh) {
String res = "";
for (int i = 0; i < N; i++) {
res += unit;
}
bh.consume(res);
}
- Using StringBuffer, like the javadocs for the String class mentions.
@Benchmark
public void stringBuffer(Blackhole bh) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < N; i++) {
sb.append(unit);
}
bh.consume(sb);
}
- Using StringBuilder
@Benchmark
public void stringBuilder(Blackhole bh) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < N; i++) {
sb.append(unit);
}
bh.consume(sb);
}
- Cheating with
String#repeat()
to show the power of pre-allocation
@Benchmark
public void stringRepeat(Blackhole bh) {
bh.consume(unit.repeat(N));
}
- Tossing in a
Stream.reduce()
for fun
@Benchmark
public void streamReduce(Blackhole bh) {
bh.consume(IntStream.range(0, N)
.mapToObj(i -> unit)
.reduce(String::concat));
}
- ... then making it parallel
@Benchmark
public void parallelStreamReduce(Blackhole bh) {
bh.consume(IntStream.range(0, N)
.parallel() // the magic sauce
.mapToObj(i -> unit)
.reduce(String::concat));
}
Let's look at the results, then break it down.
Benchmark | (N) | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|---|
stringConcat | 100000 | avgt | 5 | 8616.315 | 235.166 | ms/op |
stringBuffer | 100000 | avgt | 5 | 1.232 | 0.030 | ms/op |
stringBuilder | 100000 | avgt | 5 | 0.950 | 0.048 | ms/op |
stringRepeat | 100000 | avgt | 5 | 0.163 | 0.004 | ms/op |
streamReduce | 100000 | avgt | 5 | 6599.168 | 249.764 | ms/op |
parallelStreamReduce | 100000 | avgt | 5 | 112.556 | 5.038 | ms/op |
First of all: the naive string concatenation in a loop is the worst performer. As mentioned, this is due to Strings being immutable, so each iteration entails creating and deleting memory.
StringBuffer
and StringBuilder
perform similarly, with StringBuilder
pulling ahead slightly. This is due to StringBuffer
being synchronized, which includes a tiny bit of overhead, which we have measured.
String#repeat()
is a method that was introduced in Java 11. As you would expect, this method calculates the final size of the String ahead of time, and allocates the entire memory together. Then, it uses fast native calls to copy/duplicate the data, with a clever loop to copy at an exponential rate.
// java.lang.String
public String repeat(int count) {
// skipping some base cases and other fluff here ...
final int limit = len * count;
final byte[] multiple = new byte[limit]; // allocate in one go
System.arraycopy(value, 0, multiple, 0, len);
int copied = len;
// the <<= 1 call allows for log(n) iterations
for (; copied < limit - copied; copied <<= 1) {
System.arraycopy(multiple, 0, multiple, copied, copied);
}
System.arraycopy(multiple, 0, multiple, copied, limit - copied);
return new String(multiple, coder);
}
As expected, String#repeat()
is by far the fastest, outperforming even StringBuilder
by a factor of 4. However, it's worth noting that this method only works for repeating Strings, while the other methods work without any limitation on the String.
The sequential Stream performed slowly, though still faster than the naive iterative concatenation. The parallel implementation was able to achieve a 100x speedup over the sequential stream, though it's still 100x slower than the StringBuilder
.
Takeaway: don't naively concatenate Strings in a loop.
3. How fast can you deal your Deque?
Let's take a little break and do a quick one.
ArrayDeque
is faster than LinkedList
.
I ran a lot of benchmarks for this comparison, but let's just take a look at a curated list.
Benchmark | (N) | Mode | Cnt | Score (us/op) | Error (us/op) |
---|---|---|---|---|---|
arrayDeque init | 10 | avgt | 5 | ` 1.101 | 0.032` |
linkedList init | 10 | avgt | 5 | ` 1.079 | 0.013` |
arrayDeque init | 100000000 | avgt | 5 | ` 682196.480 | 1688984.231` |
linkedList init | 100000000 | avgt | 5 | `7181044.910 | 10077727.767` |
arrayDeque traverse | 10 | avgt | 5 | ` 1.076 | 0.038` |
linkedList traverse | 10 | avgt | 5 | ` 1.071 | 0.002` |
arrayDeque traverse | 100000000 | avgt | 5 | ` 433594.004 | 99847.736` |
linkedList traverse | 100000000 | avgt | 5 | `1354218.609 | 957037.882` |
arrayDeque removeAll | 10 | avgt | 5 | ` 1.065 | 0.002` |
linkedList removeAll | 10 | avgt | 5 | ` 1.072 | 0.008` |
arrayDeque removeAll | 100000000 | avgt | 5 | ` 449730.860 | 547815.400` |
linkedList removeAll | 100000000 | avgt | 5 | `1701792.392 | 1759202.243` |
From these results, I can safely recommend ArrayDeque
over LinkedList
to anyone who is looking to use a Queue or Stack data structure. ArrayDeque
is backed by an array, while LinkedList
is backed by nodes which need to be allocated and deallocated every time. Even considering ArrayDeque
's need to grow and shrink the backing array, the amortized cost is low enough such that the overall performance is substantially better than LinkedList
.
4. How fast can you sum a list of numbers?
For the finale, I'll present the answer to a question my friends were asking:
How fast can you sum a list of numbers? Let's say we're working with an
ArrayList
of typeDouble
.
Let's outline the solutions:
- The "classic" adding to an accumulator in a loop. This is what your typical student will come up with.
@Benchmark
public void forLoop(Blackhole bh) {
double sum = 0;
for (double num : list) {
sum += num;
}
bh.consume(sum);
}
DoubleAdder
is a class that seems to be made for this. Let's see how it performs.
@Benchmark
public void doubleAdder(Blackhole bh) {
DoubleAdder a = new DoubleAdder();
list.parallelStream().forEach(a::add);
bh.consume(a.doubleValue());
}
- Mapping to a double then summing
@Benchmark
public void mapToDoubleSum(Blackhole bh) {
double sum = list.stream()
.mapToDouble(Double::doubleValue)
.sum();
bh.consume(sum);
}
There is another equivalent variation tested which instead used the primitive identity map:
@Benchmark
public void mapToDoubleSum_prim(Blackhole bh) {
double sum = list.stream()
.mapToDouble(i -> i)
.sum();
bh.consume(sum);
}
These two performed identically, so only one will be represented in the final table.
- Mapping to a double then summing, but parallelized. Similar to the sequential stream, an equivalent variation using the primitive identity map performed identically so I won't show it here.
@Benchmark
public void mapToDoubleSumParallel(Blackhole bh) {
double sum = list.parallelStream()
.mapToDouble(Double::doubleValue)
.sum();
bh.consume(sum);
}
- Reducing a stream by
Double
's summing function
@Benchmark
public void reduceSumDouble(Blackhole bh) {
double sum = list.stream()
.reduce(0.0, Double::sum);
bh.consume(sum);
}
- ... but parallelized
@Benchmark
public void reduceSumDoubleParallel(Blackhole bh) {
double sum = list.parallelStream()
.reduce(0.0, Double::sum);
bh.consume(sum);
}
- Using
Collector.summingDouble()
@Benchmark
public void summingCollectorDouble(Blackhole bh) {
// if not for the noinspection comment, IntelliJ IDEA would simplify this call to
// double sum = list.stream().mapToDouble(i -> i).sum();
//noinspection SimplifyStreamApiCallChains
double sum = list.stream()
.collect(Collectors.summingDouble(i -> i));
bh.consume(sum);
}
- ...
public void summingCollectorDoubleParallel(Blackhole bh) {
//noinspection SimplifyStreamApiCallChains
double sum = list.parallelStream()
.collect(Collectors.summingDouble(i -> i));
bh.consume(sum);
}
Let's have a look at a curated set of results.
Benchmark | (N) | Mode | Cnt | Score (us/op) | Error (us/op) |
---|---|---|---|---|---|
forLoop | 10 | avgt | 10 | 0.006 | 0.001 |
doubleAdder | 10 | avgt | 10 | 5.551 | 0.034 |
mapToDoubleSum | 10 | avgt | 10 | 0.057 | 0.001 |
mapToDoubleSumParallel | 10 | avgt | 10 | 5.642 | 0.040 |
reduceSumDouble | 10 | avgt | 10 | 0.054 | 0.001 |
reduceSumDoubleParallel | 10 | avgt | 10 | 5.590 | 0.031 |
summingCollectorDouble | 10 | avgt | 10 | 0.063 | 0.001 |
summingCollectorDoubleParallel | 10 | avgt | 10 | 5.879 | 0.013 |
forLoop | 10000000 | avgt | 10 | 10899.721 | 114.039 |
doubleAdder | 10000000 | avgt | 10 | 7247.295 | 20.757 |
mapToDoubleSum | 10000000 | avgt | 10 | 35498.336 | 87.104 |
mapToDoubleSumParallel | 10000000 | avgt | 10 | 7246.922 | 11.496 |
reduceSumDouble | 10000000 | avgt | 10 | 29835.707 | 106.342 |
reduceSumDoubleParallel | 10000000 | avgt | 10 | 22112.735 | 18.310 |
summingCollectorDouble | 10000000 | avgt | 10 | 45315.357 | 375.390 |
summingCollectorDoubleParallel | 10000000 | avgt | 10 | 7275.261 | 93.681 |
That's a lot of data to take in. Let's break it down.
As expected, the for loop performs the fastest when there are only 10 elements. The minimal overhead allows the simple manual loop accumlation to blow (almost) everything out of the water. Surprisingly Collector.summingDouble()
performs almost as well, which is surprising given that it theoretically has some overhead. However, when we increase the element count to 1 million, this implementatino performs the worst; I'm not exactly what changed to cause this discrepancy, but it's quite curious.
DoubleAdder and the parallel stream implementations had the slowest runtimes for N=10
, due to their relatively high overheads of about 100x their sequential counterparts. However, it pays off as they have the fastest runtimes for N=1E6
, all ending up 30% faster than even the "overhead-less" for loop, except for the Stream.reduce()
solution, which seems to struggle to scale as N increases. This shows the importance of the constant factor in algorithmic analysis. Even though everyone is caught up with big O notation, the constant factors are dominant over higher powers at low N
. Only once we get into "asymptotic" territoriy of N=1E4
do we start seeing the constants falling off compared to dominant terms.
The overall lesson this shows is that the naive for loop is not a bad choice; for novice programmers, the time saved from not figuring out the other implemtations would likely result in more savings than the difference in runtime. However, the benchmark does also show that there are methods that can beat simpler solutions, so understanding how things work under the hood and leveraging hardware can result in faster code.
That was just a selection of my favourite benchmarks.
There are many, many more benchmarks in the repository, and I'm still adding benchmarks which answer questions that piqued my interest. This is an ever growing set of data that I'll continue to contribute to in the pursuit of scratching that performance itch.
And if you actually read this wall of text to this point, you're probably interested in performance as much as I am. Tag along and feel free to contribute :)