Zach Bjornson — GeistHaus

Fast, accurate summation of floating-point numbers

Aug 11, 2019

Let's say you need to sum a large array of floating-point numbers, maybe because you're calculating the arithmetic mean or variance. This post shows how to do so accurately and quickly.

Here's what the accuracy-naïve, performance-naïve implementation looks like (Godbolt):

Improving Accuracy

When summing floating-point numbers, we need to consider the following:

Overflow Adding two numbers that have the value FLT_MAX (maximum value of float, ~3.4e38) using a single-precision float accumulator will obviously overflow. To work around this, we can simply use a double-precision accumulator and be able to safely sum DBL_MAX ÷ FLT_MAX ≈ 1.8e308 ÷ 3.4e38 ≈ 5.2e269 floats. That's a lot—more than the max uint64_t.

(If you're summing a huge number of huge double-precision numbers, using long double is sometimes a possibility. On x86 that usually uses the legacy 80-bit x87 float-point unit, which is incapable of SIMD operations and has half the throughput of the SSE FPU on recent Intel and AMD CPUs.)

Floating-point error Adding two numbers of very different magnitude causes loss of significant digits. Using a double-precision accumulator helps, but we can also use compensated summation to limit the error to be essentially independent of the number of values we're adding. Kahan summation is the most famous algorithm; it's simple and effective. Neumaier's improved version ("Kahan-Babuška Algorithm") is correct in some cases where Kahan is not, but is significantly slower. Pairwise summation is also popular and requires the fewest operations, but has somewhat worse error and needs to be implemented with careful concern for data locality (cache-awareness).

Here's a (more) accurate implementation that uses Kahan summation (Godbolt):

Improving Speed

We want to sum the numbers quickly. Here's what we'll do:

Break the loop-carried dependency. Modern CPUs can start a new operation before the previous one completes. This is called instruction pipelining. On Intel's Skylake architecture, two additions can be started per cycle ("throughput") despite taking four cycles to complete ("latency"). The equivalent numbers for AMD's Ryzen are one and three.

For this to happen, the second operation must not depend on the first operation's result. In the performance-naïve code above, each loop iteration depends on the previous iteration's result (there's a "loop-carried dependency"). By using multiple accumulators in the main loop that we sum together later, we break this dependency chain. The benefits of this are visible in scalar and SIMD code (below), and even in a JavaScript implementation run in v8. Notice though that it has minimal effect on Neumaier's algorithm because it requires so many more operations that there is minimal room for pipelining of loop iterations.

Effect of varying numbers of accumulators when summing a 187,500-element array of floats (fits in L2 cache) 2000 times. Run on an Intel Xeon Cascade Lake CPU (3.1GHz base). Compiled with GCC 8.1, except "naïve", which was compiled with Clang 9.0 to avoid a bad optimization with 8 accumulators in GCC.

This also effectively unrolls the loop, which itself can improve performance.

Use SIMD. Kahan and Neumaier summation can be trivially parallelized to operate on four (AVX) or eight (AVX-512) doubles at a time.

Clang with -ffast-math (which allows reordering of floating-point operations) does both of these optimizations automatically, although it only uses four vectors of accumulators (not quite enough for max speed). GCC auto-vectorizes but doesn't unroll or use multiple accumulators. However, -ffast-math is unsafe to use with compensated summation because the compiler can and does remove operations that are algebrically void.

Below are the benchmark results. We get a final speed up of ~8x over the naïve version (along with better floating-point accuracy) and a ~32x speedup over scalar by optimizing the Kahan algorithm.

Comparison of the different summation methods (same setup as above).

Here are the final AVX(+BMI2) and AVX-512 implementations. (This code uses Intel intrinsics, which are documented here.)

The actual code used to benchmark (harder to read but easier to tinker with) is located here. It includes AVX and AVX-512 Neumaier implementations not shown above.

If you have any comments, feel free to email or tweet using the links in the footer, or open an issue in my GitHub pages repo. Corrections and improvements greatly appreciated.

http://blog.zachbjornson.com/2019/08/11/fast-float-summation.html

How long do GCP and AWS authoritative DNS servers take to update?

Aug 14, 2018

We recently had a need to create new DNS records that are visible on the authoritative server as soon as possible. This is a short and simple post showing measurements of how long it takes for new records to propagate after making the API call to add those records to Google Cloud Platform's or AWS's Route 53 DNS servers.

GCP has guidance for monitoring changes to a DNS zone. They don't say what it means for a change's "status" to be "done", but that happens after just a few seconds and definitely doesn't mean that the record has propagated to all authoritative servers. They do say the authoritative servers should pick up the change within 120 seconds. AWS states that GetChange returns a status of INSYNC generally within 60 seconds.

Figure 1. **GCP** Percent of DNS queries responding with the new answer, queried several times per second for each second after the API request to add the record. MNAME is the primary name server according to the SOA record; c2/c3/c4 are other name servers for the domain.

Figure 2. **AWS** As above. The four series are labeled by the TLD of the DNS server.

Try it yourself

http://blog.zachbjornson.com/2018/08/14/dns-propagation.html

AWS S3 vs Google Cloud vs Azure:<br>Cloud Storage Performance

Dec 29, 2015

I’m building a new cloud product that quickly processes large amounts of scientific data. Our largest customer dataset so far is about 3,000 tables, each 40 to 80 MB and totaling 150 GB, which we aim to process in 10 seconds or less. Each table can be processed independently, so we parallelize heavily to reach this goal — our deployment uses 1000 vCPUs or more as needed. The tricky part is rapidly reading that much data into memory. Right now about 80% of the computation time is spent reading data, and that leads to the focus of this three-part post: cloud storage performance. As part of solving this problem, I evaluated a few different approaches: object storage, database-backed storage and attached storage, each of which I’ll be detailing in separate posts.

Midway through writing this, Google happened to released their multi-cloud PerfKitBenchmarker. As far as I can find, this post is also the first set of published results from PerfKitBenchmarker.

Part 1: Object Storage

Obligatory notice: Benchmarks are application-specific and vary based on network load, neighboring VM activity, the month in which you were born and phase of the moon. When reading this, consider if these benchmarks are relevant to your application and understand that these results may not be reflective of what you will experience.

Object storage (Amazon AWS Simple Storage Service (S3), Google Cloud Storage (GCS), Microsoft Azure Storage) provides a simple PUT/GET/HEAD/LIST interface for storing any sort of data too large to put into a database. It is appealing because it is cheap, provides safety through redundancy and scales automatically to serve many concurrent requests. The downsides are the latency (in part from establishing a new HTTP connection for every file) and the throughput being limited by the network throughput and availability.

downloads

I looked at two key metrics: time to first byte (TTFB) to measure latency, and throughput (Figure 1), when downloading files from a VM hosted by the same vendor in the same region. In these plots, the percentile (or quantile) is on the x axis, and the metric is on the y (note the log scale), so you can see the typical response (median, 0.5) as well as the best and worst case behaviors.

Azure and AWS S3 gave essentially the same latency, whereas GCS averaged more than three times higher latency. However, Google provided approximately 4x the throughput of Azure and approximately 2x the throughput of S3. Thus, GCS should finish downloading sooner than Azure for files larger than ~1 MB and sooner than S3 for files larger than 5 MB. (Emphasis on finish downloading. If your application can handle streaming input and is processing the data slower than the data is being downloaded, then S3 and Azure will perform better for any file size.)

See that unnaturally flat right-tail on AWS S3 at 91 MB/s (732 Mbps)? I bet that’s the maximum throughput enforced by AWS.

network throughput caps are problematic

Network throughput is core to these benchmarks, and thus VM type must be considered when designing an architecture. Above I’m showing results from VM types that have sufficient network bandwidth to accurately measure storage throughput as opposed to VM throughput. I’ll go into network throughput in detail in a future post, but for now looking at Figure 2 you can see that for GCE and Azure a 1 vCPU machine provides sufficient throughput, whereas AWS EC2 requires a much larger machine — an 8 vCPU c4.2xlarge would have sufficed, but I used a 16 vCPU c4.4xlarge.

Few EC2 instance types have sufficient network throughput to make S3’s throughput the bottleneck if you were to download one file per vCPU (i.e. divide network bandwidth by vCPU count, which is a crude but useful view; Figure 2-right). Google Compute Engine on the other hand has much higher network throughput such that most machine types have more bandwidth than GCS, even despite GCS having higher throughput than S3. Azure’s low throughput is accommodated by most of their instance types as well.

Figure 2. Measured intra-datacenter network throughput for varying VM types, total (left) and normalized by CPU count (right). Showing mean and standard deviation. Throughput was measured for 60 seconds using iperf. All tests were repeated two to six times. Shared-core machines were considered to have one vCPU when normalizing by CPU count. GCE tests were run in us-central1-a, except n1-standard-32 which was run in us-central1-b. AWS tests were run in us-east-1a. Azure tests were run in East US. See also [1] and [2].

To be thorough, I did run the storage benchmarks on 16 vCPU instances from GCE (n1-standard-16) and Azure (D5_v2), and a 1 vCPU instance from AWS (m3.medium). Interestingly, the smaller VMs from Google and Microsoft had slightly higher storage throughput than the larger VMs (not shown), possibly because of how VMs are distributed on shared hardware. AWS’s m3.medium was tightly capped at 32.5 MB/s.

GCS multi-region buckets rock

The GCS results in Figure 1 are for a US “multi-region” bucket. S3 has no multi-region buckets; S3 buckets are actually equivalent to GCS regional buckets. In contrast, multi-region buckets allow access from that set of regions without data transfer charges or duplication costs. (AWS charges $0.02 per GB for accessing a bucket from another region.) This is ideal if you have data processing servers in multiple regions. The throughput from both (…all two!) US data centers is excellent (Figure 3). I can't find anything useful in the terms of service, but I suspect multi-region buckets are more fault-tolerant as well.

Figure 3. Download throughput from all seven zones accessing a US multi-region bucket. (Data from Figures 4-right and 5-right.)

The docs in the above link suggest that a regional bucket will provide lower latency and higher throughput than a multi-region bucket, so I also tested a us-central1 regional bucket from all four us-central1 compute engine zones (Figure 4), and a GCS us-east1 regional bucket from all three us-east1 zones (Figure 5), to see if we can get even faster throughput or improve GCS’s poor latency.

Figure 4. Benchmarks for accessing a US multi-region bucket (solid) or a us-central1 regional bucket (dashed) from the four us-central1 zones.

Figure 5. Benchmarks for accessing a US multi-region bucket (solid) or a us-east1 regional bucket (dashed) from the four us-east1 zones.

With regional buckets, latency somehow increased. Median throughput was either the same or slightly higher with regional buckets, although the tails were typically worse. Odd.

uploads

For my application I don’t depend on fast uploads, but I’m including some of the benchmark results here for completeness (Figure 6). Similar to downloads, Google had higher latency but faster throughput, Microsoft had the lowest latency and lowest throughput, and AWS was in the middle for both.

object storage handles concurrency well

Finally, Google’s gsutil includes a perfdiag command, and because it’s built on boto it can actually be used for both GCS and S3. This tool includes the ability to read or write multiple objects at once, which adds some new information to the story (Table 3). This shows that these platforms scale well to serve concurrent requests at the scale that we would be using them.

Concurrency GCS S3 Read (MB/s) Write (MB/S) Read (MB/s) Write (MB/S) 1 113 34 61 20 5 480 156 306 87 10 641 284 520 128 20 845 588 603 141

Table 3. Throughput from gsutil perfdiag. Number of replicates: 20. File size: 32 MB. Concurrency: how many objects were read or written at once.

conclusions

Amazon and Azure provide the lowest latency, while Google provides the highest throughput, for both uploads and downloads. This means that AWS and Azure excel for smaller files, while GCE excels for larger files, and this highlights the importance of benchmarking with data that are comparable in size to what your application uses.
The substantial limitations on AWS EC2 network throughput must be taken into consideration when designing high-speed data processing systems.
Google's unique multi-region buckets keep costs down when working with data from multiple datacenters in the same region (e.g. continent).
Object storage scales automatically to provide high aggregate throughput.
Finally, note that I’m only showing data from API access (which is the exact same boto code for AWS and Google), and I have unsurprisingly observed substantial differences in performance from different clients (the vendor-specific CLIs, node.js API package, cURL’ing URLs, etc.).

Stay tuned for parts two and three on data storage, and future posts on other cloud performance metrics.

http://googlecloudplatform.blogspot.com/2015/11/bringing-you-more-flexibility-and-better-Cloud-Networking-performance-GA-of-HTTPS-Load-Balancing-and-Akamai-joins-CDN-Interconnect.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-config.html. While EBS-optimized instances have separate NICs dedicated to EBS, from my own testing these values seem to be the same as the limits found on the non-EBS NIC. There are some inconsistencies between what I measured and this page; for example, an m3.xlarge should have a limit of 62.5 MB/s, but I observed 126 MB/s, which is their reported limit for an m3.2xlarge. I wouldn’t be surprised if this was a botched AWS config setting. Same goes for the t2.micro’s high network performance, which contradicts what others (such as http://www.azavea.com/blogs/labs/2015/01/selecting-a-nat-instance-size-on-ec2/) have reported. More on network throughput in a future blog post!

http://blog.zachbjornson.com/2015/12/29/cloud-storage-performance.html

https://blog.zachbjornson.com/feed.xml

Posts