Fun with HdrHistogram
A colleague recently schooled me on performance testing with some links to posts and videos about Gil Tene’s “coordinated omission” problem. I started incorporating the HdrHistogram library into my analysis of performance results and while there are some helpful intros into how to use this library and visualize the results, I didn’t find a detailed explanation into what the raw data was so I’ve added a note here.
Create the Histogram Class
Which instance you pick and the values you pass to it depend on your needs. In my case, I have a few threads running concurrently so I used the atomic version to avoid issues in recording samples concurrently.
private final Histogram histogram = new AtomicHistogram(TimeUnit.MINUTES.toNanos(5), 3);
When creating the instance, you choose the range that you want to support and the level of precision. The samples I’m recording should be in the 1-2 second range on average with outliers at 5x to 10x maximum so I pass in a cap of 5 minutes as the maximum value we’ll record.
Record your samples
Here’s a snippet on recording the sample. I’m using a try-with-resources block to record the sample’s start and stop times.
try (AutoCloseable ignored = wrap(histogram)) {
// ... insert operation here you want to measure
}
private AutoCloseable wrap(Histogram h) {
return new AutoCloseable() {
long start = System.nanoTime();
@Override
public void close() throws Exception {
long end = System.nanoTime();
h.recordValue(end - start);
}
};
}
I noticed that IntelliJ won’t yell at me about the unused AutoCloseable instance if I name it “ignored”.
Output the HDR file
When you’re all done with your tests, you can output the data you’ve collected to a file like so:
private static void emit(File resultsFile, Histogram results) throws IOException {
try (FileOutputStream fout = new FileOutputStream(resultsFile)) {
results.outputPercentileDistribution(new PrintStream(fout), 1000.0);
}
}
I’m passing the 1,000 as the scaling value in order to report all of the results in microseconds. Recall that they’re captured in nanoseconds.
Review of the HdrHistogram Output
The example below is the output from 1,000 iterations of a test.
Value Percentile TotalCount 1/(1-Percentile)
85458.943 0.000000000000 1 1.00
101449.727 0.100000000000 100 1.11
108986.367 0.200000000000 201 1.25
114556.927 0.300000000000 300 1.43
121831.423 0.400000000000 402 1.67
129826.815 0.500000000000 500 2.00
135528.447 0.550000000000 550 2.22
139722.751 0.600000000000 601 2.50
144965.631 0.650000000000 651 2.86
154927.103 0.700000000000 700 3.33
161087.487 0.750000000000 750 4.00
163708.927 0.775000000000 775 4.44
167903.231 0.800000000000 801 5.00
171442.175 0.825000000000 825 5.71
176422.911 0.850000000000 850 6.67
182845.439 0.875000000000 875 8.00
186122.239 0.887500000000 889 8.89
191758.335 0.900000000000 900 10.00
196607.999 0.912500000000 913 11.43
202244.095 0.925000000000 925 13.33
212729.855 0.937500000000 938 16.00
217841.663 0.943750000000 945 17.78
227016.703 0.950000000000 950 20.00
246415.359 0.956250000000 957 22.86
270794.751 0.962500000000 963 26.67
290979.839 0.968750000000 969 32.00
313262.079 0.971875000000 972 35.56
355205.119 0.975000000000 975 40.00
414449.663 0.978125000000 979 45.71
423100.415 0.981250000000 982 53.33
434896.895 0.984375000000 985 64.00
440926.207 0.985937500000 986 71.11
467664.895 0.987500000000 988 80.00
503316.479 0.989062500000 990 91.43
511442.943 0.990625000000 991 106.67
3095396.351 0.992187500000 993 128.00
3095396.351 0.992968750000 993 142.22
3110076.415 0.993750000000 994 160.00
3147825.151 0.994531250000 995 182.86
3152019.455 0.995312500000 996 213.33
3170893.823 0.996093750000 998 256.00
3170893.823 0.996484375000 998 284.44
3170893.823 0.996875000000 998 320.00
3170893.823 0.997265625000 998 365.71
3170893.823 0.997656250000 998 426.67
3179282.431 0.998046875000 999 512.00
3179282.431 0.998242187500 999 568.89
3179282.431 0.998437500000 999 640.00
3179282.431 0.998632812500 999 731.43
3179282.431 0.998828125000 999 853.33
3193962.495 0.999023437500 1000 1024.00
3193962.495 1.000000000000 1000
#[Mean = 167488.553, StdDeviation = 273423.746]
#[Max = 3193962.495, Total count = 1000]
#[Buckets = 29, SubBuckets = 2048]
It was important for me to understand each of these columns and how they were used to produce the plot.
Value
The cumulative bucket for the recorded latency. This units in this case are microseconds because that’s what I selected for the scaling value when generating the output. The range of the bucket can be inferred from its position in the table. The first row includes all entries from 0 to the specified value. The next row includes all samples from the previous row(s) up to the given value.
Again, it’s a cumulative histogram. This differs from what you’ve probably seen before where the histogram is showing the distribution of values and each bucket is rendered as a vertical bar.
Notice that every 5 rows in the table moves you 50% closer to the 100% mark. In this way, we’re inching up to the max value in order to get more and more precision in the results. This is configurable when you export the data to a file with the default being 5.
Percentile
The quantile representing the latency distribution. An entry with 0.30 means that the given measurement includes 30% of the observed samples. In the example below, 30% of the samples had a latency of 114 milliseconds or less.
Total Count
The number of samples in the given bucket. Keep in mind that the bucket actually includes all of the previous rows so this number should increase with each row up to the total number of samples recorded.
1/(1-Percentile)
This value is derived from the Percentile column using the formula. It’s used to convert the percentile into a value that can be plotted. Each row in the table below has an increasing level of precision for the samples. As you progress, this precision increases exponentially.
Plotting the Results
There’s a great online tool for plotting your results using Google’s charting API. It reads the output file natively and renders a chart in HTML. Simply visit the web page, upload the file, and you have the results.
I’ve also incorporated these plots into my build results but that’s another post.
Y Axis
Recall that the samples were scaled to microseconds so you should label the Y Axis as such (there’s a dropdown control on the page to specify the units). The value for the Y axis is the value from the Value column in the histogram data file.
X Axis
The X Axis is the percentile for the test. The actual value used for the X is the 1/(1-Percentile) column which is a nice way of converting small differences in the increasing precision into something that can be plotted.
Conclusion
Read through the links above to Gil’s articles and videos, it’s great material. The key thing for me was to understand the form of the data being recorded. Hopefully you have a better understanding of it now as well.