Wickopedia — GeistHaus

How do bitwidths actually work in Verilog?

John Wickerson May 6, 2026

This post introduces a project that Gabriel Desfrene did in my group as an intern last summer. Gabriel will present his project in a paper coauthored with my PhD students Quentin Corradi and Michalis Pardalos and myself at CAV 2026 in July. Verilog (aka SystemVerilog) is the most widely used language for specifying, designing, and… Continue reading How do bitwidths actually work in Verilog? →

Show full content

This post introduces a project that Gabriel Desfrene did in my group as an intern last summer. Gabriel will present his project in a paper coauthored with my PhD students Quentin Corradi and Michalis Pardalos and myself at CAV 2026 in July.

Verilog (aka SystemVerilog) is the most widely used language for specifying, designing, and verifying all kinds of computer hardware. But some aspects of the language are widely misunderstood – even some as fundamental as “how many bits are used for each operation”.

Consider this example, which is due to Quentin:

logic [3:0] foo;
assign foo = (3'b110 + 3'b110 + 3'b110 + 3'b110) >> 2;

The first line is declaring foo to be a 4-bit variable. On the second line, 3'b110 is Verilog for “the 3-bit binary number 110”, and >> is the right-shift operation. That is, we are performing the addition of four copies of 110, then shifting to the right by 2, and finally storing the result in foo.

Verilog is a hardware description language: it describes pieces of hardware. This means that the addition written in Verilog has to map to a physical “adder” in the eventual hardware. And in order for that adder to come into being, it must have a width.

But what is the correct width?

Option 1. We could use a 3-bit adder – the same width as the operands. Then 110+110+110+110 gives 000 (it would have been 11000, but the highest two bits are lost to overflow). And after the right-shift we get 000. Thus, the final value of foo, after adding a leading zero to make a 4-bit value, is 0000.
Option 2. Perhaps we should use a wider adder? This would have the advantage of not losing bits to overflow, but the disadvantage of requiring more circuitry (which leads to hardware that ultimately consumes more electricity). If we don’t want to lose any bits to overflow here, we need at least a 5-bit adder. Then 110+110+110+110 gives 11000, and after the right-shift we get 00110. Thus, the final value of foo, after chopping off the leading zero to make a 4-bit value, is 0110.

Which of these options does the Verilog language standard say is the correct one? Well, it actually opts for Option 3: the width of the adder is, in fact, determined by the width of the variable being assigned to (i.e. foo) which is 4 bits in our example. So the addition 110+110+110+110 gives 1000 (losing just one bit to overflow this time), and after the right-shift we get 0010 as the final value for foo.

Examples like this are a well-known source of confusion among hardware engineers. For instance, a fairly well established Verilog tutorial from ChipVerify.com quite reasonably warns its readers that “[calculating the bitwidths of expressions] is subtle”.

It then goes on to get the explanation completely wrong. It incorrectly states that if a and b are 8-bit variables, then “assign product = a * b” will “lose the upper bits of the actual 16-bit product”, even if product has 16 bits. (In fact, this multiplication happens at 16 bits, that being the width of the left-hand side of the assignment, so no bits are lost to overflow.)

I asked Reddit’s resident Verilog experts about ChipVerify’s example. They unanimously (at the time of writing) described the tutorial as incorrect.

But they all gave different reasons as to why it is incorrect.

And none of those reasons were right.

Still, as Bill Vaughan famously wrote, “To err is human; to really foul things up requires a computer.” So, let’s see whether today’s AI models understand Verilog’s bitwidth rules any better than those humans do. Here’s what ChatGPT said when I asked it what value gets assigned to foo in my example:

Each operand is a 3-bit value (3'b110 = 6). In SystemVerilog, when you add fixed-width vectors, the result does not automatically grow—it stays the width of the operands unless you explicitly widen it. So each intermediate addition is truncated back to 3 bits [and] the full sum becomes 3'b000. […] Finally, assigning to logic [3:0] foo zero-extends the result: foo = 4'b0000 (decimal 0).

In other words, it’s going for Option 1. This is wrong.

Meanwhile, Meta’s Muse Spark model goes for Option 2, which is also wrong:

In SystemVerilog, when you add unsigned values, the result width is max(widths) + 1 for each addition. […] Result: foo = 4'b0110 = 6 decimal. Key gotcha: If you thought the 3-bit operands would truncate the sum at each step, they don’t. Verilog/SystemVerilog expands the width during expression evaluation. The truncation only happens on the final assignment to the 4-bit foo, but 6 fits fine.

Claude Opus 4.1 also goes for Option 2 to begin with, but then it changes its mind mid-response, and ends up going for the correct Option 3:

The value assigned to foo is 6 (binary 0110). […] Wait—that’s not quite right either. Let me reconsider the expression context. […] Hmm, but there’s more nuance—the 3'b110 literals are explicitly 3-bit, and the intermediate width depends on how operands combine. […] Since the final assignment target is 4-bit foo, the entire expression is context-determined to 4 bits. […] Final answer: foo = 4'b0010 (decimal 2).

Google’s Gemini is the only model I tried that confidently gets it right first time:

In SystemVerilog, the bit-width of an expression is determined by the “context-determined” rules. Since foo is a 4-bit logic type, the entire right-hand side is evaluated using 4-bit precision. […] The value assigned to foo is 4'b0010 (decimal 2).

Anyway, the long and short of it is this: Verilog’s rules for bitwidths are confusing.

To some extent, the rules are unavoidably complicated, but a large part of the problem is that they are specified in informal prose that is spread over several parts of the (very large) Verilog standard.

So, aiming to bring as much clarity as possible to the situation, we decided to formalise Verilog’s rules for bitwidths in the Rocq theorem prover. The murky details are in our paper and in Gabriel’s Rocq development, but in this blog post I’ll just explain how things work roughly, via a few examples.

The calculation happens in two stages. These stages are called “determine” and “propagate”. In the “determine” phase, bitwidth information flows up the expression tree, calculating the “self-determined width” of every sub-expression. Then, in the “propagate” phase, bitwidth information flows back down the expression tree, fixing the “final width” of every sub-expression.

Here’s how it works on our example. We start with the expression we want to analyse, drawn as a syntax tree:

The addition is labelled as just a “BinOp” (binary operation) because all arithmetic operations (additions, subtractions, multiplications, and so on) are handled in the same way. The only node that has a bitwidth already is foo, and that bitwidth is obtained from foo‘s declaration.

The “determine” phase sends bitwidth information up this tree, as shown by the red arrows in the following picture:

Working from the bottom, we see that the constants all have a self-determined width of 3, and this fixes the self-determined width of the BinOp at 3 too (3 being the max of 3, 3, 3, and 3). Thus the ShiftOp also has a self-determined width of 3 (it ignores the width of its right-hand operand). The width of the top-level assignment operation is determined only by its left-hand operand, foo, which has a width of 4.

Then, the “propagate” phase sends bitwidth information back down the tree. In the following picture, we write “ShiftOp : 3 → 4” to mean that the self-determined width of the ShiftOp is 3, but its final width is 4.

Thus, we can see that the final width of the adder (the BinOp in the middle of the picture above) is 4 bits.

For the example from ChipVerify’s tutorial, the “determine” phase sets the self-determined width of the multiplication to be 8 bits, but the “propagate” phase sets its final width to 16 bits because this is the width of the left-hand side of the assignment.

I’ll do one last example, as there’s one more subtlety that is worth mentioning. Consider another little Verilog design:

logic [3:0] baz;
assign baz = (3'b110 + {2'b10, 1'b0}) >> 2;

It’s similar to our first example, but it includes a concatenation operation, in which the 2 bits of 2'b11 are placed alongside the single bit of 1'b0 to form the 3-bit value 3'b100. Drawn as a syntax tree, we have:

After running the “determine” phase, we get:

Note that the self-determined width of the concatenation operation is the sum (+) of the widths of its operands. The rest is similar to the previous example.

The difference comes when we run the “propagate” phase:

Note that the width of 4 has propagated from baz, setting the final width of the shift, the BinOp, the constant 3'b110, and the concatenation. But it has not propagated below the concatenation. The concatenation operation can thus be seen as a “boundary” to propagation. In the terminology of the Verilog standard, each operands of a concatenation operation has its bitwidth “solely determined by the expression itself” and not “by the fact that it is part of another expression”.

So there we go; that’s more-or-less how bitwidth calculations really work in Verilog. Our paper has more of the details, and more still are available in Gabriel’s Rocq development. We imagine that our work can be valuable in a couple of ways. Firstly, we expect that it can strengthen the guarantees provided by tools like Lustig and Vera, which can reason formally about the correctness about Verilog designs but currently have to assume that all bitwidths have been written out explicitly. Secondly, we hope that our work can help the maintainers of the Verilog standard come up with a way to explain bitwidth calculations with more clarity and precision, so as to minimise confusion in the future.

http://johnwickerson.wordpress.com/?p=2776

Extensions

Finding bugs in FPGA place-and-route tools

John Wickerson Jan 6, 2026

This blog post is about Ollie Cosgrove’s paper (with Ally Donaldson and me) that will be presented at the 34th International Symposium on FPGAs in February. Over the last few years, I’ve enjoyed being a part of several projects that have been giving hardware design tools a hard time; that is: developing techniques for uncovering… Continue reading Finding bugs in FPGA place-and-route tools →

Show full content

This blog post is about Ollie Cosgrove’s paper (with Ally Donaldson and me) that will be presented at the 34th International Symposium on FPGAs in February.

Over the last few years, I’ve enjoyed being a part of several projects that have been giving hardware design tools a hard time; that is: developing techniques for uncovering bugs in the tools that hardware designers use to design hardware. We’ve done testing campaigns on logic synthesis tools, on high-level synthesis tools, on equivalence checkers, and on Verilog parsers – all of which have found correctness bugs in widely used commercial tools.

In this work, we’ve looked at another important tool of hardware design: FPGA place-and-route (P&R) engines. A P&R engine takes as input a netlist, which is a low-level description of a hardware design that consists of a collection of components (memories, registers, adders, logic gates, and so on) and how they are to be wired together. The job of the P&R engine is to work out how best to assign those components to the FPGA’s available computational resources (that’s the “placement” part) and how those resources should be connected (that’s the “routing” part). A P&R engine is allowed to modify the netlist if that enables a better outcome – as long as doing so doesn’t change the design’s behaviour, of course! For instance, it might decide to replicate a component in order to avoid the need for really long wires.

Ollie built a tool, called FUZNET, that generates thousands of random (but valid) netlists, fed them into AMD’s Vivado P&R engine, and checked in each case whether the pre- and post-P&R netlists are equivalent. He was able to find two bugs in Vivado 2024.2, one of which has been received by AMD but not yet confirmed, and one of which has been confirmed and fixed for the 2025.2 version. I’ll explain this second bug shortly, but first I’ll make a few remarks about FUZNET itself.

Why fuzzing P&R engines is challenging

In some ways, this was just another fuzzing project: generating lots of random inputs, feeding them to a system-under-test, and investigating any unexpected outputs. But P&R turned out to be a surprisingly interesting domain for fuzzing, for five reasons:

Fuzzing is effective at uncovering bugs not because its randomly generated test cases are particularly high-quality, but because it churns through such a huge number of them, just like monkeys wielding typewriters. In other words, if fuzzing is going to work, we need to run test cases at a fast rate. This is a challenge for P&R engines, which take a long time to run! Hence, Ollie had to try to make each test case as high-quality as possible.
Usually, fuzzing involves generating very large test cases, since these are likely to be challenging for the system-under-test to handle, and hence most likely to trigger bugs. But bigger is not necessarily better in a P&R context, because a P&R engine is targeting a device with a fixed set of resources. If a given test case has no hope of fitting onto the target device, the P&R engine will simply give up. Hence, Ollie had to try to generate test cases that are large enough to challenge the P&R engine, but small enough to fit on the target device.
When the P&R engine does produce a netlist, we need a way to check whether it is equivalent to the netlist it was given. Even this seemingly simple task is tricky, because no single approach offers a silver bullet: simulating the pre- and post-P&R netlists with a bunch of random input vectors is not exhaustive, but trying to prove the pre- and post-P&R netlists logically equivalent using automated solving can be intolerably slow! Hence, Ollie had to come up with a clever combination of equivalence-checking techniques that would work well in the common case.
If fuzzing ever does stumble upon a bug-triggering test case, we need to perform “test-case reduction”: a trial-and-error process for finding a minimal version of the test case that still triggers the bug. Most reduction algorithms involve churning through a huge number of trials, which is fine when the system-under-test produces each output quickly, but not great when testing a slow-running P&R engine! Hence, Ollie had to devise a new reduction algorithm that converges quickly.
The other problem with test-case reduction arises from how the P&R engine works. Conventional software compilers produce output that is “as good as possible”, but most tools in hardware design are designed to produce output that is “as good as it needs to be”. That is: the hardware designer gives the P&R engine a set of constraints (about things like circuit area and clock frequency), and once those have been met, the P&R engine won’t bother trying to optimise any further. This complicates test-case reduction, because if removing a chunk from a failing test-case leads to it now passing, we don’t know if that’s because that particular chunk was the crux of the bug, or simply because the P&R engine didn’t need to invoke so many optimisations on the smaller test-case! Hence, Ollie had to make sure that when reducing a test-case, he also upped the constraints by the right amount so that the same optimisations would still be performed.

Explaining the bug

Here’s a picture of the confirmed-and-fixed bug in AMD Vivado’s P&R engine.

The netlist at the top is the one given to the P&R engine; the P&R engine changes it to the one at the bottom, but the two are not quite equivalent. The explanation is a little involved, but I’ll try. The brown “CFGLUT5” component is a “5-input dynamically reconfigurable look-up table”, which means that it is essentially a table with 32 rows, and it uses five inputs, I4–I0, to select which of those rows should be output (on O5 and O6). In the pre-P&R netlist, the input vector is (0,0,0,0,1), which selects row 1. However, Vivado has noticed that the table happens to be initialised so that rows 0 and 1 contain the same data, so it chooses to transform the circuit to use a simpler input vector, (0,0,0,0,0), which selects row 0.

However, this reasoning fails to account for the “dynamically reconfigurable” nature of the CFGLUT5. Normal look-up tables have their entries fixed at initialisation, but those in a CFGLUT5 can be changed at runtime. This is done by setting the CE signal and then feeding the new entries serially through the CDI port. In the pre-P&R netlist, the CE signal is constantly set, which means that the CFGLUT5 is continually being reconfigured, using whatever data comes in on input _02_ (top left). The behaviour of the CFGLUT5 component is thus completely unknowable to the P&R engine, and as such, the P&R engine cannot safely make any changes to the CFGLUT5’s inputs. Thus, in particular, changing the input vector from (0,0,0,0,1) to (0,0,0,0,0) is not sound.

In conclusion: once again, fuzzing turns out to be effective at finding subtle bugs in mature and widely used tools. There are plenty of ways FUZNET could be extended to find more bugs, e.g. by incorporating more advanced components into the netlists it generates. It would also be nice to expand the class of P&R bugs that FUZNET is capable of finding: currently, it only checks the changes that the P&R engine has made to the netlist, but there could also be bugs in the actual placement and routing that the engine comes up with – devising techniques that are capable of uncovering these is a job for future work.

http://johnwickerson.wordpress.com/?p=2757

Extensions

Are conference reviewers harsher when they have a submission of their own?

John Wickerson Nov 19, 2025

My area of academia runs mainly on conferences, as opposed to journals. This means that a few times each year, hundreds of researchers simultaneously submit papers about their latest and greatest project, and a “program committee” decides which of those get to appear in the proceedings of the upcoming conference. There is usually considerable overlap… Continue reading Are conference reviewers harsher when they have a submission of their own? →

Show full content

My area of academia runs mainly on conferences, as opposed to journals. This means that a few times each year, hundreds of researchers simultaneously submit papers about their latest and greatest project, and a “program committee” decides which of those get to appear in the proceedings of the upcoming conference.

There is usually considerable overlap between the members of that program committee and the researchers who submit papers. This is partly because if conferences didn’t allow the members of their program committee also to submit their own papers, few would be willing to serve on that committee. (But of course, nobody reviews their own submission!)

It seems to me that if a member of the program committee has a submission of their own under review, and wishes to act purely selfishly, then they should be as negative as possible about all the papers they are reviewing. Thus, the argument goes, they should increase the chances of their own paper ending up near the top of the rankings and ultimately being accepted.

To what extent does this actually happen?

I accessed some data for a recent computer science conference. I partitioned the 73 members of the program committee into the 47 who had no submission of their own and the 26 who had at least one. (Only a handful of reviewers had more than one submission, so I lumped them all together.)

I also looked at the average scores given by all those reviewers. Each reviewer reviewed between one and five papers, and scored each one on a 4-point scale (strong reject, weak reject, weak accept, strong accept). So I took each reviewer’s “mean review score” by a weighted average (strong reject = -1.5, weak reject = -0.5, weak accept = 0.5, strong accept = 1.5).

I found that the mean review scores given by the two groups are as follows:

No paper under review-0.117At least one paper under review-0.478

Both numbers are negative, which means both groups of reviewers lean towards rejecting rather than accepting – what grumps! But those that have a paper of their own under review seem noticeably more negative than those that don’t.

Here’s a graphical version, which shows a bit more information:

The graph shows the distribution of mean review scores for the two groups (blue line = no submission, green line = at least one submission). The dots are the actual data points after dividing the review scores into buckets of width 0.2; the lines are moving averages (window = 2) which smoothes out the noise a bit. We see that both distributions are vaguely Gaussian, peaking a bit below a score of 0, but that the green line is noticeably skewed to the left (more negative scores).

I also calculated the Mann–Whitney U-test on this data, which promises to quantify whether this apparent skew is actually statistically significant. The number crunching says yes, there is a statistically significant effect here: there’s only about a 0.4% chance that this skew is simply caused by random noise.

So that’s quite interesting. Reviewers do seem to be harsher when they have a paper of their own under submission.

Of course, it’s perfectly possible that having a paper under submission doesn’t actually cause a reviewer to be harsher, but rather, that there is a third factor causing both of these things to be simply correlated.

One candidate factor is the seniority of the reviewer. Senior professors might give lower review scores because they’re more familiar with the research area and hence find it easier to spot missing citations, etc.; they may also be more likely to have papers under submission from their large research group. Or it might go the other way: senior professors might give higher scores because they gloss over the little flaws that junior folk pick up on, and they may be less likely to have papers under submission because they’re not focusing so urgently on building up their publication profile.

But no, it would seem not. Here’s a graph that shows the relationship between how many publications a reviewer has co-authored (according to their DBLP profile) and their review scores.

Visually, there seems to be no relationship, and sure enough, Spearman’s coefficient of rank correlation confirms that there is no statistically significant correlation here.

The same applies if we use “number of years since first publication” as another proxy measure for seniority, again using data from DBLP:

Again, there’s no visual correlation here, and Spearman agrees. So seniority doesn’t appear to be a confounding factor here.

Another factor to consider is whether the reviewer is affiliated with a University or not. Perhaps reviewers who are affiliated with a company or a government organisation are not so focused on growing their publication count, and hence may feel less “competitive” when reviewing.

My data says that yes, this does indeed seem to be an important factor. The graph below plots the scores given by reviewers currently affiliated with a University (blue line) and reviewers not currently affiliated with a University (green line).

It looks like the University-affiliated reviewers are harsher (mean=-0.329) than their non-University-affiliated counterparts (mean=0.027). The Mann–Whitney U-test reports a 4% chance that this is just down to random noise, so there’s just about a statistically significant effect here.

What’s certainly the case is that the non-University-affiliated reviewers are not submitting many papers: of the 17 reviewers, only one of them had a submission of their own, whereas the 56 University-affiliated reviewers had an average of 0.64 submissions.

One way to proceed here would simply be to discount all the non-University-affiliated reviewers. Here’s the graph obtained from doing that:

We see that the University-affiliated reviewers with a submission of their own (green line) are still noticeably harsher than those without (blue line). And the Mann–Whitney U-test confirms that there’s still a statistically significant effect going on here: there’s only a 2% chance that this is just down to random noise.

So, what to do about this?

One could imagine attaching a little “badge” to reviewers’ names in the discussion forums that indicates “I have at least one paper under submission myself”. That might discourage them from being overly negative about a paper. Then again, reviewers might overcompensate and become too positive in their reviews, for fear of being accused of acting in a self-interested manner!

A more subtle approach could be to make this information visible to program chairs while papers are being allocated to reviewers, so that they can try to ensure that each submission has a balanced number of reviewers from both groups.

http://johnwickerson.wordpress.com/?p=2725

Extensions

Fuzzing Quantum Compilers

John Wickerson Dec 18, 2024

Ilan Iwumbwe and Benny Liu did undergraduate research placements with me this summer, and I’m very pleased that they will be presenting their work at the Programming Languages for Quantum Computing (PLanQC) workshop at POPL next month. Ilan and Benny built a tool called QuteFuzz that randomly generates descriptions of quantum circuits. The idea is… Continue reading Fuzzing Quantum Compilers →

Show full content

Ilan Iwumbwe and Benny Liu did undergraduate research placements with me this summer, and I’m very pleased that they will be presenting their work at the Programming Languages for Quantum Computing (PLanQC) workshop at POPL next month.

Ilan and Benny built a tool called QuteFuzz that randomly generates descriptions of quantum circuits. The idea is to use these randomly generated circuits to fuzz-test quantum compilers, such as Quantinuum’s Pytket, IBM’s Qiskit, and Google’s Cirq. We want to check that they don’t crash and that any optimisations they do on the quantum circuits don’t change their behaviour.

QuteFuzz is not the first random generator of quantum circuits – there exist a handful of tools that do this already, such as QDiff and MorphQ – but it does aim to test a couple of language features that prior tools don’t reach. Specifically, QuteFuzz focuses on producing random quantum circuits that contain subcircuits and control flow.

Subcircuits are analogous to subroutines in ordinary programming: they allow a piece of computation to be defined once and then instantiated many times.
Control flow refers to quantum circuits where the qubits can pass through different sequences of quantum gates depending on the outcome of a measurement.

Below is an example of a quantum circuit generated by QuteFuzz. I won’t explain it line-by-line, but I’ve labelled the part that defines a subcircuit and the parts that define if-statements for control flow.

Here is a graphical representation of that same circuit. We see the subcircuit (circuit-163), the one-armed if-statement (If-0 to End-0), and the two-armed if-statement (If-1 to Else-1 to End-1).

The natural way to deploy these randomly generated quantum circuits is to use differential testing: checking that the circuits behave the same way if compiled by two different compilers, or at two different optimisation levels. The problem with this approach is that two circuits producing different final qubit values doesn’t necessarily indicate a compiler bug – it could just be the inherent non-determinism of quantum computing. One workaround for this is to simulate the circuits hundreds of times, until we are reasonably confident (via the Kolmogorov–Smirnov test) that the two circuits are using the same probability distribution of final qubit values.

For example, the graph below shows the final qubit values observed from 1024 simulation runs of that randomly generated circuit when compiled using Qiskit with no optimizations. The output is “16” on 790 of the runs, and “48” on all the rest.

The next graph shows the result of simulating that same circuit having compiled it with the same compiler at optimisation level 3. It is immediately obvious that the qubit values are following a completely different distribution. This turned out to be the result of an invalid optimisation being performed by the Qiskit compiler.

All in all, QuteFuzz found seventeen bugs in the quantum compilers that Benny and Ilan tested – some miscompilations, some crashes, and some in the simulators that were used to produce graphs like the ones above. All but two of those bugs have since been confirmed by the developers, and three have already had fixes deployed.

For more information about QuteFuzz, check out Benny and Ilan’s Github repo and their paper.

http://johnwickerson.wordpress.com/?p=2653

Extensions

The Hoare Cube

John Wickerson Dec 4, 2024

I wrote earlier this year about my attempt to understand the repercussions of toggling and when giving a semantics to Hoare triples. In response to that post, Yann Herklotz helpfully pointed me to Patrick Cousot’s POPL 2024 paper [10], which does all this and a lot more besides. In particular, Cousot’s paper includes the following… Continue reading The Hoare Cube →

Show full content

I wrote earlier this year about my attempt to understand the repercussions of toggling $\subseteq$ and $\supseteq$ when giving a semantics to Hoare triples. In response to that post, Yann Herklotz helpfully pointed me to Patrick Cousot’s POPL 2024 paper [10], which does all this and a lot more besides. In particular, Cousot’s paper includes the following diagram…

…which shows how to obtain dozens of program logics just by toggling a handful of switches in the semantics.

However, that diagram is completely terrifying, and (just between you and me) I can’t get my head around much of the rest of Cousot’s paper. So, this blog post is my attempt to end up with a similar diagram to Cousot’s one, but while trying to keep things as simple as possible.

Let us assume that we have a set of states, and that a command is simply a binary relation on states. We shall write $(\sigma, \sigma') \in c$ to mean that if command $c$ is started in state $\sigma$ it can produce final state $\sigma'$ . Our commands need not always terminate: if there is no $\sigma'$ such that $(\sigma, \sigma') \in c$ , then that means $c$ does not terminate when started in state $\sigma$ . Also, our commands can be non-deterministic: if $(\sigma, \sigma') \in c$ and $(\sigma, \sigma'') \in c$ with $\sigma' \neq \sigma''$ , then that means $c$ behaves non-deterministically when started in state $\sigma$ .

The original “partial correctness” interpretation [1] of the Hoare triple $\{P\}\,c\,\{Q\}$ is that whenever $c$ is started in a state satisfying the assertion $P$ (an assertion is just a set of states), any final state it reaches must satisfy the assertion $Q$ . That is…

$\forall \sigma, \sigma'\ldotp \sigma \in P \wedge (\sigma,\sigma') \in c \longrightarrow \sigma' \in Q$ .

Let us define a helper function: the “strongest postcondition”. We shall write $\text{sp}(P,c)$ for the set of all states that command $c$ can reach if it starts executing in a $P$ state…

$\text{sp}(P, c) = \{\sigma'\ldotp \exists\sigma\ldotp (\sigma,\sigma') \in c \wedge \sigma \in P\}$ .

We can use $\text{sp}$ to give a semantics to Hoare triples in a variety of different ways. For instance, the partial correctness interpretation given above can be rewritten as…

$\text{sp}(P,c) \subseteq Q$ .

Flipping the $\subseteq$ into a $\supseteq$ to get Incorrectness logic…

$\text{sp}(P,c) \supseteq Q$

…is what led O’Hearn to remark that “program correctness and incorrectness are two sides of the same coin”. But there are other components that can be “flipped” too. I think there are three “dimensions” in total:

Reach: we can flip from a may semantics (“the states $c$ can reach if…”) to a must semantics (“the states $c$ can only reach if…”)
Direction: we can use $c$ as-is (a forward semantics), or we can invert it to $c^{-1}$ (a backward semantics).
Approximation: we can flip between $\subseteq$ (where the command’s behaviour is “over approximated”) and $\supseteq$ (where the command’s behaviour is “under approximated”).

(NB: Cousot seems to identify four dimensions in his diagram above, but I haven’t understood what his fourth one is.)

The eight options, then, are as follows:

Note that partial correctness can also be rewritten into the more familiar $P \subseteq \text{wlp}(c, Q)$ if we define $\text{wlp}$ as the “weakest liberal precondition” in the usual way…

$\text{wlp}(c,Q) = \{\sigma\ldotp \forall\sigma'\ldotp (\sigma,\sigma') \in c \longrightarrow \sigma' \in Q\}$

…but I quite like how “systematic” the table above is: it distinguishes may from must by negating the pre and postconditions; it distinguishes forward from backward by using $c$ or $c^{-1}$ ; and it distinguishes over from under approximate by using $\subseteq$ or $\supseteq$ .

Note also that I’ve labelled $\text{sp}(\neg P,c) \subseteq \neg Q$ as the “original” formulation of necessary preconditions, as it captures the two negatives in Cousot et al.’s definition quite nicely: “under which precondition, if violated, will the program always be incorrect?” [7, my emphasis].

We can assemble all eight options in the table into a cube — a “Hoare cube”, if you will.

In the cube, I’ve drawn solid lines between vertices that represent equal semantics. For instance, may+forward+over is equal to must+backward+over because…

$\text{sp}(P,c) \subseteq Q \quad\Longleftrightarrow\quad \text{sp}(\neg Q,c^{-1}) \subseteq \neg P$ .

Sometimes these equalities can be phrased as Galois connections. For instance…

$\text{sp}(P,c) \subseteq Q \quad\Longleftrightarrow\quad P \subseteq \text{wlp}(c, Q)$

…signifies a Galois connection between $\text{sp}(-,c)$ and $\text{wlp}(c, -)$ .

I’ve also drawn dashed lines between vertices that represent equivalent semantics. By “equivalent” here, I mean that there exists a bijection $f$ between triples such that the triple $\{P\}\,c\,\{Q\}$ holds according to one semantics if and only if the triple $f(\{P\}\,c\,\{Q\})$ holds according to the other semantics. What this means is that one strategy for proving $\{P\}\,c\,\{Q\}$ under one semantics is to prove $f(\{P\}\,c\,\{Q\})$ under the other semantics.

For instance, may+backward+over (necessary preconditions, $\text{NC}$ ) is equivalent to may+forward+over (partial correctness, $\text{PC}$ ) because…

$\text{NC} \models \{P\}\,c\,\{Q\} \quad\Longleftrightarrow\quad \text{PC} \models \{\neg P\}\,c\,\{\neg Q\}$

…as Ascari et al. have observed [4, Proposition 5.4]. Here we use the bijection $f_1$ that maps $\{P\}\,c\,\{Q\}$ to $\{\neg P\}\,c\,\{\neg Q\}$ .

For another instance, may+forward+under (incorrectness logic, $\text{IL}$ ) is equivalent to may+backward+under (sufficient incorrectness logic, $\text{SIL}$ ) because, using another bijection $f_2$ that maps $\{P\}\,c\,\{Q\}$ to $\{Q\}\,c^{-1}\,\{P\}$ , we have…

$\text{IL} \models \{P\}\,c\,\{Q\} \quad\Longleftrightarrow\quad \text{SIL} \models \{Q\}\,c^{-1}\,\{P\}$ .

For this reason, I’ve deemed all of the over nodes equivalent (and coloured them green) and all of the under nodes equivalent (and coloured them blue).

But is it a bit of a stretch to consider all those same-coloured nodes “equivalent”? I mean, if the point of these bijections is to suggest how a proof in one logic can be transformed into a proof in another logic, then are these transformations reasonable? The bijection $f_2$ , for instance, requires the ability to invert an arbitrary command (turning $c$ into $c^{-1}$ ). That’s straightforward enough for simple programs, thanks to properties like $(c; d)^{-1} = (d^{-1} ; c^{-1})$ and $(c + d)^{-1} = (c^{-1} + d^{-1})$ and $(c^*)^{-1} = (c^{-1})^*$ , but it could become infeasible in more complicated settings. Likewise, the bijection $f_1$ requires the ability to negate an arbitrary assertion (turning $P$ into $\neg P$ ), which is fine in ordinary predicate logic but doesn’t work very nicely in a substructural logic like separation logic.

Still, it does give a pretty handy way to generate proof rules for one logic from the proof rules for another logic. For instance, Ascari et al. [4] present this proof rule from incorrectness logic…

…and if we apply our bijection $f_2$ to all the triples in this rule, then we obtain “for free” the corresponding proof rule from their sufficient incorrectness logic…

…without having to prove anything about its soundness.

To conclude: perhaps program correctness and incorrectness are not simply two sides of the same coin, but two opposite faces of the same cube!

Related work

Möller et al. [2, Section 5] define and briefly study “backward under-approximate Hoare triples”. Zilberstein et al. [3] refer to those triples also as “Lisbon triples”. Ascari et al. [4] define sufficient incorrectness logic as a proof system for those “backward under-approximate Hoare triples”, and I’m grateful to Ike Mulder who referred me to Ascari et al.’s paper in a comment on my earlier blog post.

Zilberstein et al. [3] defines and studies “Outcome logic”, but I haven’t included it in my “cube” because I don’t think it quite fits with the other logics – its postconditions are not simply “sets of states”, but have a more complicated monoidal structure. I also haven’t included “Exact separation logic” [11], which unifies over-approximate and under-approximate reasoning in the context of separation logic.

Zhang and Benjamin Kaminski [8, Section 6.3] also studied the eight options in my table above, similarly obtained by toggling three binary switches (albeit with slightly different names), and I’m grateful to Benjamin for pointing me to his work in a comment on my earlier blog post. I’ve reproduced their table here:

The two blue rows in their table are equal, and they correspond to the two “partial correctness” vertices in my cube. The two orange lines are also equal, and they correspond to the two “necessary preconditions” vertices in my cube. I also haven’t got names for the ??? and ¿¿¿ logics, but I note that ??? can be straightforwardly obtained from sufficient incorrectness logic via the $f_1$ bijection I defined above, and ¿¿¿ can be obtained from incorrectness logic in the same way, so I think this makes it unlikely that those logics will offer anything particularly new.

For the record: one potential point of confusion in Zhang and Kaminski’s table is that what they call “wp” is actually what O’Hearn [6, Section 6.3] calls the “weakest possible precondition”: the sets of states from which $c$ can reach a final state satisfying the postcondition (but due to non-determinism, might not). Thus, I think “total correctness” should probably be read as “possible correctness”, following Hoare’s terminology [9, Section 5.3].

Bibliography

[1] C.A.R. Hoare: An axiomatic basis for computer programming. CACM, 1969.
[2] Bernhard Möller, Peter O’Hearn, and Tony Hoare: On Algebra of Program Correctness and Incorrectness. RAMiCS, 2021.
[3] Noam Zilberstein, Derek Dreyer, and Alexandra Silva: Outcome Logic: A Unifying Foundation for Correctness and Incorrectness Reasoning. OOPSLA, 2023.
[4] Flavio Ascari, Roberto Bruni, Roberta Gori, and Francesco Logozzo: Sufficient Incorrectness Logic: SIL and Separation SIL. ArXiv, 2023.
[5] Edsko de Vries and Vasileios Koutavas: Reverse Hoare Logic. SEFM, 2011.
[6] Peter O’Hearn: Incorrectness Logic. POPL, 2020.
[7] Patrick Cousot, Radhia Cousot, Manuel Fähndrich, Francesco Logozzo: Automatic Inference of Necessary Preconditions. VMCAI, 2013.
[8] Linpeng Zhang and Benjamin Lucien Kaminski: Quantitative Strongest Post. OOPSLA, 2022.
[9] C.A.R. Hoare: Some properties of predicate transformers. JACM, 1978.
[10] Patrick Cousot: Calculational Design of [In]Correctness Transformational Program Logics by Abstract Interpretation. POPL, 2024.
[11] Petar Maksimovic, Caroline Cronjäger, Andreas Lööw, Julian Sutherland, and Philippa Gardner: Exact Separation Logic: Towards Bridging the Gap Between Verification and Bug-Finding. ECOOP, 2023.

http://johnwickerson.wordpress.com/?p=2571

Extensions

Fun with income tax

John Wickerson Nov 26, 2024

The UK government imposes a tax on people’s income, and, as is quite conventional, the rate at which a person pays this tax increases as their income increases. However, I was surprised to notice recently that this rate does not increase monotonically with income. Here is how the UK government explains the calculation of income… Continue reading Fun with income tax →

Show full content

The UK government imposes a tax on people’s income, and, as is quite conventional, the rate at which a person pays this tax increases as their income increases.

However, I was surprised to notice recently that this rate does not increase monotonically with income.

Here is how the UK government explains the calculation of income tax:

So far, so simple. You pay no tax on the first £12,570 of your income, then you pay 20% of the next £37,700, then 40% of the next £74,870, and finally 45% of whatever’s left.

However, there is an additional complexity not captured by the table above. Once your income hits £100,000, your Personal Allowance (the portion of your income on which you pay no tax) starts to go down. It reduces by £1 for every £2 your income is over £100,000, eventually stopping at £0 once your income hits £125,140.

It’s not completely clear how the bands in the table above are supposed to change if the Personal Allowance reduces. Should the Basic-rate band stretch as the Personal Allowance band contracts? Should the Basic-rate band slide down and the Higher-rate band stretch? Or should both the Basic-rate and Higher-rate bands slide down and the Additional-rate band kick in earlier?

I believe the second option is the correct one. This is because a separate government web page indicates how the bands are supposed to work when the Personal Allowance is £0, and it shows the Basic rate only applying for the first £37,700 of your income, and the Additional rate still kicking in over £125,140. (Ok, technically “over £125,141”, but I think that’s just a typo.) That means it must be the Higher-rate band stretching.

So, as your income increases within the £100,000–£125,140 region, not only does more of it lie in the 40%-tax band, but also some of your income moves out of the no-tax band into the 40%-tax band. When these two effects are combined, it turns out that the tax paid on income in the £100,000–£125,140 region is actually 60%, which is considerably higher than the 45% tax paid on income above that region!

The graph below shows the effect quite clearly. The x-axis shows a person’s income and the y-axis shows the total tax they pay out of that income. See how the blue line turns upwards at x=£100,000, and then downwards again at x=£125,140.

I think there are a couple of ways that income tax rates could be explained more clearly. One way would be to abolish the Personal Allowance reduction, and instead introduce an extra 60% band into the table. That would give:

BandTaxable incomeTax ratePersonal AllowanceUp to £12,5700%Basic rate£12,571 to £50,27020%Higher rate£50,270 to £100,00040%Even higher rate£100,001 to £125,14060%Additional rateover £125,14045%

A perennial problem with presenting tax rates like this is that it’s easy to forget that each tax rate only applies to the part of the income that lies in the relevant band. For instance, it’s easy to think that if your income is £100,000, you pay 40% of it in tax (when the reality is that you’d only pay about 27%).

One alternative presentation, which tries to address this somewhat, is as follows:

BandIf your income is…Then you keep…Personal AllowanceUp to £12,570All of itBasic rate£12,571 to £50,270The first £12,570, plus 80% of the restHigher rate£50,270 to £100,000The first £31,420, plus 60% of the restEven higher rate£100,001 to £125,140The first £54,280, plus 40% of the restAdditional rateover £125,140The first £30,660, plus 55% of the rest

Then again, we could be a bit more radical, and get rid of the bands altogether. Here’s one possible formula for calculating your tax rate (as a percentage) using the exponential function.

$\text{tax rate} = 50 \times ( 1 - \exp(-\text{income} \div 120000))$

That is: take the negative of your income (in £), divide it by 120000, calculate the exponential function of the result, subtract that from 1, and finally multiply by 50. Yes it’s a little convoluted, but it does avoid the problems of the band-based system that I explained above. And with the numbers chosen like I have done, the tax paid using the exponential formula (green line in the graph below) is remarkably similar to the tax paid in the current system (blue line):

Taxable incomeTotal tax paid (band system)Total tax paid (my formula)£1,0000£4£10,0000£400£20,000£1,486£1,535£30,000£3,486£3,318£40,000£5,486£5,669£80,000£19,432£19,463£120,000£39,432£37,927£160,000£58,203£58,912£200,000£76,203£81,112

One feature of my formula-based proposal is that there’s no Personal Allowance – income tax is paid on all income, no matter how small. But perhaps it’s good for social cohesion if people can see that the same formula applies to everybody’s income, no matter how high or low it is. Another feature of my proposal is that very high earners do end up paying a bit more – asymptoting to 50% of their income eventually. But perhaps once their income is that high, they won’t notice contributing a bit more of it!

Addendum. A couple of people have mentioned National Insurance. This is an additional tax on income: 0% of any income between £0 and £12,570, plus 8% of any income between £12,570 and £50,270, plus 2% of any income above £50,270. It could be abolished and rolled into income tax. In my formula-based approach, it is a simple matter of changing one of the numbers:

$\text{tax rate} = 50 \times ( 1 - \exp(-\text{income} \div 100000))$

The graph above would then become:

http://johnwickerson.wordpress.com/?p=2536

Extensions

Who checks the equivalence checkers?

John Wickerson Sep 10, 2024

This post provides a short introduction to a paper that will be presented at DVCon in Munich next month by my PhD student Michalis Pardalos about work we have done with collaborators Alastair Donaldson, Emi Morini and Laura Pozzi. Hardware engineers take great care to ensure that their designs are correct. Unlike bugs in software,… Continue reading Who checks the equivalence checkers? →

Show full content

This post provides a short introduction to a paper that will be presented at DVCon in Munich next month by my PhD student Michalis Pardalos about work we have done with collaborators Alastair Donaldson, Emi Morini and Laura Pozzi.

Hardware engineers take great care to ensure that their designs are correct. Unlike bugs in software, which can often be easily fixed with a quick patch, mistakes in hardware can be exceedingly difficult and costly to address. As such, it is vitally important that hardware synthesis tools, which take human-readable hardware designs and transform them into lower-level representations that are readily implemented in silicon, are correct.

In fact, hardware engineers routinely “double check” the working of these synthesis tools by feeding the input they were given and the output they produced to an “equivalence checker“. An equivalence checker is a tool that takes two hardware designs, perhaps one written in a high-level language like SystemC and one written in a lower-level language like Verilog, and checks whether the two designs really are equivalent; i.e., that they behave in the same way in all possible situations.

The results from these equivalence checkers are deeply trusted, even to the point of authorising a design to be “signed-off” for tape-out and fabrication. Indeed, the Synopsys marketing blog claims that their HECTOR equivalence checker:

delivers 100% confidence that the RTL design implementation conforms to the C/C++ reference algorithm, thereby significantly speeding up signoff for datapath components as compared to simulation-based techniques.

while Siemens claim that their SLEC equivalence checker:

enables designers to have the ultimate confidence to move to high-level synthesis […] dramatically reducing or eliminating the need for design teams to perform simulation/verification of RTL.

My PhD student Michalis Pardalos has been examining how trustworthy these equivalence checkers really are.

He has built a fuzzing tool called Equifuzz, which generates random programs in the SystemC language. (SystemC extends C++ with variable-width datatypes and a few other things, and it is widely used to specify hardware.) There are a few nifty features that make Equifuzz well suited for the task of testing equivalence checkers:

First, it generates SystemC programs in an iterative fashion, maintaining well-formedness at each step. For instance, here is one possible path it could take:

Start with random seed value of -1:
int dut() {
return -1;
}
Add a cast to a 10-bit fixed-point number with an 8-bit integer part (and hence a 2-bit fractional part):
sc_dt::sc_fixed<10,8> dut() {
sc_dt::sc_fixed<10,8> x0 = sc_dt::sc_fixed<10,8>(-1);
return x0;
}
Add a cast to a 8-bit unsigned integer:
sc_dt::sc_uint<8> dut() {
  sc_dt::sc_fixed<10,8> x0 = sc_dt::sc_fixed<10,8>(-1);
  sc_dt::sc_uint<8> x1 = sc_dt::sc_uint<8>(x0);
  return x1;
}

Generating programs in an iterative fashion like this means that test-case reduction becomes rapid – a simple matter of deleting as many of the transformations as possible while preserving the bug. Rapidly converging test-case reduction is always nice, but is especially vital when testing equivalence checkers because they can be quite slow to produce a result, often taking a minute or so.

Second, it generates input-free SystemC programs that calculate a single numerical result. Prior fuzzers like Csmith and YARPGen do the same, but what’s useful about this when testing equivalence checkers is that it is often possible to rephrase “false negative” bugs as “false positives”. To see this, consider the last SystemC program listed above. Compiling it with GCC and then executing the result tells us that it should produce the output 255. Then suppose we give that program to an equivalence checker, together with a Verilog design that produces the correct output:

module top (output [7:0] out); assign out = 8'd255; endmodule

If the equivalence checker wrongly reports “not equivalent”, then we have found a false negative bug. This is valuable, but a false negative is a “failsafe” kind of bug: it will simply prevent the hardware engineer from proceeding. However, the equivalence checker usually produces a counterexample in its output log, and by inspecting this we may find that it thinks our SystemC program actually produces a different output, say 252. It stands to reason, then, that if we ask the equivalence checker if our SystemC program is equivalent to this Verilog design:

module top (output [7:0] out); assign out = 8'd252; endmodule

then we should observe a false positive bug. A false positive is a more critical kind of bug because it could lead a hardware engineer to think that it is ok to proceed when it is not.

Third, Equifuzz doesn’t avoid undefined behaviour. Most compiler fuzzers like Csmith or YARPGen go to considerable effort to avoid generating programs that contain undefined behaviour, because such programs are useless for testing compilers – a compiler can do anything it likes with them! The situation is different for equivalence checkers, though. A program that contains undefined behaviour must never be deemed “equivalent” – even to itself!

In all the testing Michalis has done to date, Equifuzz has found seven false-positive bugs in three major commercial ECs (where the checker claims equivalence incorrectly). Equifuzz has additionally found five “incompleteness” bugs (where the checker failed to prove equivalence between equivalent designs).

One of the false-positive bugs is the example given above, where the correct output is 255 but one of the equivalence checkers we tested reported the output as being 252. The crux in this case is the implementation of the cast from sc_fixed to sc_uint. The SystemC specification says that this cast should discard the fractional part of the fixed point number before interpreting the integer part as an sc_uint, but this particular equivalence checker was instead interpreting the entire bitvector.

As a second example of a bug we found, the following SystemC program (which happens not to use any SystemC-specific features) should produce 0x3FF0000000000000:

double dut() { int x0 = 1; return double(bool(x0)); }

but another equivalence checker we tested deemed it equivalent to Verilog that produces 0x0000000000000001 instead. This equivalence checker seems to be conflating the representation of “1” as a 32-bit integer and the representation of “1” as a 32-bit double, which has a very different bitvector.

For more details about Equifuzz and the bugs it found, check out Michalis’s paper, or say hello to him at DVCon in Munich next month!

tingey-injury-law-firm-yCdPU73kGSc-unsplash

http://johnwickerson.wordpress.com/?p=2510

Extensions

Automated feature testing of Verilog parsers using fuzzing

John Wickerson Sep 9, 2024

I’m delighted that Quentin Corradi, a PhD student I jointly supervise with George Constantinides, will be presenting his work to improve the reliability of hardware design tools next week at the FUZZING’24 workshop, a satellite event of the ISSTA conference. The Verilog language is widely used in hardware design, and is accepted by a multitude… Continue reading Automated feature testing of Verilog parsers using fuzzing →

Show full content

I’m delighted that Quentin Corradi, a PhD student I jointly supervise with George Constantinides, will be presenting his work to improve the reliability of hardware design tools next week at the FUZZING’24 workshop, a satellite event of the ISSTA conference.

The Verilog language is widely used in hardware design, and is accepted by a multitude of tools, including synthesisers, simulators, and equivalence checkers. Some are open-source, such as ANTLR4, hdlConvertor, Icarus Verilog, Moore, Surelog, Slang, sv-parser, sv2v, Tree-sitter-verilog, Verible, Verilator, Verismith, and Yosys, while others are proprietary, such as Cadence Conformal, Lattice Diamond, Siemens FormalPro, Cadence Genus, Cadence Jasper, Siemens Leonardo Spectrum, Siemens Oasys-RTL, Intel Quartus, Siemens Questasim, AMD Vivado, and Cadence Xcelium Parallel Logic Simulation.

However, most of these tools do not accept the entire Verilog language. And not only do they not accept the entire Verilog language: they often do not make it clear exactly which features they do or don’t accept.

This situation can make it difficult to transfer a Verilog design from one tool to another. And, according to Dave Rich, the technical chair of the IEEE SystemVerilog Working Group committee, it even leads some people to stick with Verilog-1995 syntax, abandoning all the language features that have been standardised since then, simply because feature support is so inconsistent between different tools and different vendors.

Some effort has already been made to document the feature support of Verilog-consuming tools. The SV-Tests project, for instance, does just this for a range of open-source Verilog-consuming tools. However, support for each language feature is assessed only by running a small set of handwritten tests, so it is hard to be confident that a feature really is fully supported, even when it is reported as such. For instance, SV-Tests reports that a Verilog parser called “Surelog” has full support for a Verilog language feature called “escaped identifiers”. However, our pilot experiments quickly identified that the following (valid) Verilog design:

module \TEST ; endmodule

is (incorrectly) rejected by Surelog. (The root cause is actually quite amusing. Surelog uses the following regular expression for escaped identifiers:

ESCAPED_IDENTIFIER: '\\' [WS\r\t\n]*? WS; WS: [ ]+;

The intention here is that an escaped identifier begins with a backslash (\), then has a sequence of characters that are not spaces (WS) or line breaks (r or n) or tabs (t), and finally ends with a space (WS). The problem is that when WS appears inside the square brackets, it is interpreted not as an abbreviation for [ ]+, but as the individual characters W and S! As such, Surelog does not support escaped identifiers that contain an uppercase W or an uppercase S, such as \TEST in the example above.)

As such, we are working to do more thorough testing of Verilog feature support with the help of grammar-based fuzzing. Starting from a comprehensive grammar of the Verilog language that we have captured from the official standard, we generate lots of syntactically valid but otherwise random Verilog designs, and see which ones are accepted by which Verilog-consuming tools.

In parallel with this “acceptance testing” (i.e. testing which valid inputs are correctly accepted), we are also doing some “rejection testing” (i.e. testing which invalid inputs are correctly rejected). Rejection testing may seem less critical; after all, surely it is a good thing if a tool does something useful with input that lies outside of the standardised Verilog language? But this is not always the case: it could lead to a hardware engineer wrongly thinking their design is standard-compliant and running into problems if they later feed it to a different Verilog-consuming tool. Here’s one rather trivial example that our pilot experiments identified:

module top #(parameter x = 0,); endmodule

This is technically not valid Verilog, thanks to the trailing comma in the parameter list. But it is accepted by Yosys.

I have talked above about “pilot experiments” because the paper Quentin is presenting at the FUZZING workshop next week is not the end of the story. Rather, it is a registered report – commonplace in the traditional sciences but fairly radical in computing research. His paper details the fuzz-testing experiments that he plans to run, which means that the workshop committee has accepted it based on the strengths of his hypotheses and his proposed method, rather than on how many bugs he happened to stumble upon. Stay tuned to find out how many bugs his testing campaign on the tools listed at the top of this post actually reveal!

http://johnwickerson.wordpress.com/?p=2497

Extensions

In praise of Freddy, My Love

John Wickerson Jul 4, 2024

I was singing Freddy, My Love as a lullaby for my toddler last night, and enjoying the nifty rhyming triplets (catches–patches–matches, and so on), when I suddenly realised that the song has almost twice as many rhymes as I had previously thought! The last three lines of each verse not only rhyme at their ends,… Continue reading In praise of Freddy, My Love →

Show full content

I was singing Freddy, My Love as a lullaby for my toddler last night, and enjoying the nifty rhyming triplets (catches–patches–matches, and so on), when I suddenly realised that the song has almost twice as many rhymes as I had previously thought! The last three lines of each verse not only rhyme at their ends, but they rhyme half way through as well! It made me give an audible gasp (not ideal while lullabying) so I thought it worth sharing.

Here are the lyrics. The rhymes are in bold; the ones in blue are new to me. They mean that each verse ends with an AB–AB–AB pattern.

Freddy, my love,
I miss you more than words can say.
Freddy, my love,
Please keep in touch while you’re away.
Hearing from you can make the day so much better,
Getting a souvenir or maybe a letter,
I really flipped over the grey cashmere sweater!

Freddy, you know,
Your absence makes me feel so blue.
That’s okay though,
Your presents make me think of you.
My ma will have a heart attack when she catches
Those pedal-pushers with the black leather patches.
Oh, how I wish I had a jacket that matches!

Don’t keep your letters from me,
I thrill to every line.
Your spelling’s kinda crummy,
But honey, so is mine.
I treasure every gifty.
The ring is really nifty.
You say it cost you fifty,
So you’re thrifty –
I don’t mind!

Freddy, you’ll see,
You’ll hold me in your arms someday,
And I will be
Wearing your lacy lingerie.
Thinking about it, my heart’s pounding already.
Knowing when you come home, we’re bound to go steady
And throw your service pay around like confetti!

Of course, tinkerers gotta tinker, so here’s my attempt to force even more rhymes into each verse (my changes are in red). To be clear, I think Jim Jacobs and Warren Casey’s original lyrics work better overall, but it’s rather fun to see an ABC–ABC–ABC rhyming pattern.

Freddy, my love,
I miss you more than words can say.
Freddy, my love,
Please keep in touch while you’re away.
Hearing from you can make the day so much better,
Getting a souvenir or maybe a letter,
I really swooned over the grey cashmere sweater!

Freddy, you know,
Your absence makes me feel so blue.
That’s okay though,
Your presents make me think of you.
My ma will go into a rage when she catches
Those penny loafers with the aged leather patches.
Oh, how I hope you’ll send the beige skirt that matches!

Don’t keep your letters from me,
I thrill to every line.
Your spelling’s kinda crummy,
But honey, so is mine.
I treasure every gifty.
The ring is really nifty.
You say it cost you fifty,
So you’re thrifty –
I don’t mind!

Freddy, you’ll see,
You’ll hold me in your arms someday,
And I will be
Wearing your lacy lingerie.
Just thinking through it, my heart’s pounding already.
Knowing when you come home, we’re bound to go steady
And throw your duty pay around like confetti!

http://johnwickerson.wordpress.com/?p=2481

Extensions

Mix-testing: revealing a new class of compiler bugs

John Wickerson Jun 28, 2024

I’m delighted that Luke Geeson’s work on “mix testing” (a collaboration with James Brotherston, Wilco Dijkstra, Alastair Donaldson, Lee Smith, Tyler Sorensen, and myself) will appear at OOPSLA 2024 in October. There has been quite a lot of work on “litmus testing” of compilers in the last couple of decades. This mainly takes the form of… Continue reading Mix-testing: revealing a new class of compiler bugs →

Show full content

I’m delighted that Luke Geeson’s work on “mix testing” (a collaboration with James Brotherston, Wilco Dijkstra, Alastair Donaldson, Lee Smith, Tyler Sorensen, and myself) will appear at OOPSLA 2024 in October.

There has been quite a lot of work on “litmus testing” of compilers in the last couple of decades. This mainly takes the form of (1) generating lots of little concurrent programs, (2) feeding them into a compiler, and then (3) seeing if the compiled code does the right thing when executed on the target machine. Luke’s idea is: what if we don’t give the whole litmus test to a single compiler, but instead we break it up into little fragments, and have each fragment compiled by a different compiler. Thus, we have the potential to find bugs that arise out of unexpected interactions between compilers.

Here’s a simple example of how a “mixing bug” might arise when targeting the x86 architecture.

Consider the following “store buffering” litmus test, written in C:

Two threads write and read to two shared locations, x and y. The C standard requires that after executing these two threads, the values in t and u must not both be zero (in other words: at least one of the threads must see the other thread’s write).

However, if this program is compiled naïvely to x86, that requirement will not be met, because when x86 processors encounter a write instruction followed by a read of a different location, they do not promise to keep them in order. Therefore, the compiler must insert a barrier between them, to prevent the potential reordering.

There are two plausible ways to accomplish this. One is for the compiler to insert a barrier after each write. That would yield something like this:

The other option is to insert a barrier before each read. That would yield something like this:

The first scheme is more popular because it tends to result in fewer barriers, given that writes are generally less common than reads. It is the scheme used by the major compilers, GCC and Clang. But either scheme is fine … as long as it is used consistently. Problems arise if the two schemes are mixed. For instance, if the first instructions in both threads are compiled using the barriers-before-reads scheme, and the second instructions in both threads are compiled using the barriers-after-writes scheme, then both compilers will wrongly assume that the other compiler is responsible for keeping the instructions in order, and no barrier will be inserted at all!

That example serves to illustrate what mixing bugs look like, but it’s not really a bug because it is already well understood that the barriers-before-reads and barriers-after-writes compilation schemes must not be mixed.

So, here is a real mixing bug. It’s a little more complicated, but the basic idea is the same: two different compilation schemes interacting in such a way that fails to enforce some necessary ordering.

Consider the same “store buffering” litmus test again, but this time compiled for Arm. Here’s how it is compiled by Clang for the Armv7 architecture:

I’ll skip the gory details of the assembly code. What’s important is how the compiler ensures that the stores and loads don’t get reordered. In this case, the compiler has inserted a barrier (DMB) at the end of all four instruction sequences. In particular, the barriers marked with the red rectangles are the ones that are important here.

Here’s how that same program is compiled by Clang for the Armv8 architecture:

This mapping is rather more efficient because it has avoided all those (expensive) barriers, and instead exploited some features that only became available in Armv8. Specifically, store-release instructions (STL) cannot be reordered with subsequent load-acquire instructions (LDA). In particular, this means that the stores and loads marked with the blue puzzle pieces cannot be reordered. I use a puzzle-piece metaphor here to indicate that these instructions only enforce ordering when they appear together. Neither instruction acts as a barrier on its own.

Now suppose we compile the first instruction in each thread for Armv8, but the second instruction for Armv7. This is perfectly legal, by the way – the resulting code should run fine on an Armv8 machine because Armv8 is fully backward-compatible with Armv7. We get something like this:

We see that neither the Armv7 nor the Armv8 mechanism for inserting a barrier is working. The Armv7 scheme relies on a barrier being inserted after the store, but the Armv8 scheme does not provide this barrier! Instead, the Armv8 scheme uses a store-release, but this is ineffective because the Armv7 scheme does not produce the corresponding load-acquire! In short, nothing is preventing the stores and loads being reordered, and we have a compiler bug.

Luke has built a tool that explores lots of possible compiler combinations for each litmus test, and he has used it to find several bugs like the one above. (That bug remains open at the time of writing, by the way.) Luke is now working with his colleagues at Arm to make their ABI more explicit about how and when various compilation schemes can be safely mixed. Hopefully, this effort will provide firmer foundations for mixed compilation in the future.

This work was supported by the Engineering and Physical Sciences Research Council (grant number EP/V519625/1). The views expressed in this blog post are not endorsed by Arm or any other company mentioned.

http://johnwickerson.wordpress.com/?p=2447

Extensions

Cycling from King’s Cross to Imperial College London

John Wickerson May 11, 2024

I’ve used Santander Cycles to get from King’s Cross to Imperial College hundreds of times over the last couple of years, and I’d like to think that I’m pretty close to finding the fastest possible route now. The basic route is to aim for New Cavendish Street and then cross Hyde Park, but I’ve found… Continue reading Cycling from King’s Cross to Imperial College London →

Show full content

I’ve used Santander Cycles to get from King’s Cross to Imperial College hundreds of times over the last couple of years, and I’d like to think that I’m pretty close to finding the fastest possible route now.

The basic route is to aim for New Cavendish Street and then cross Hyde Park, but I’ve found several tweaks that can shave off a few seconds, e.g. a quick zig-zag along Leigh Street to avoid the traffic lights at the Judd Street/Tavistock Place intersection.

The whole thing tends to take me about 27 minutes, which is handy because Santander Cycles only starts charging after 30 minutes.

I’ve optimised primarily for speed, but quite a bit of the route is on dedicated cycle lanes, and it’s not completely devoid of interest along the way – one passes by Gordon Square, Tavistock Square Gardens, University College London, the University of Westminster, Baker Street, Hyde Park, and the Royal Albert Hall.

Here’s a picture of the route from Google Maps.

Picture of a cycle route from King's Cross to Imperial College London.

The return from Imperial College to King’s Cross is basically identical, except you have to stay on the A5204 all the way from Edgware Road to Tottenham Court Road because of the one-way system.

http://johnwickerson.wordpress.com/?p=2428

Extensions

Verified high-level synthesis – now with hyperblocks!

John Wickerson Mar 1, 2024

Yann Herklotz has added hyperblock scheduling to his verified high-level synthesis (HLS) tool called Vericert, and I’m very pleased that our paper about this work has been accepted to appear at PLDI 2024 this June. This paper was our “difficult second album”, in the sense that we’d already published a paper about the first version… Continue reading Verified high-level synthesis – now with hyperblocks! →

Show full content

Yann Herklotz has added hyperblock scheduling to his verified high-level synthesis (HLS) tool called Vericert, and I’m very pleased that our paper about this work has been accepted to appear at PLDI 2024 this June.

This paper was our “difficult second album”, in the sense that we’d already published a paper about the first version of Vericert in OOPSLA 2021, so we no longer had such a crisp “novelty” story. That is, we were no longer presenting “the first mechanically verified HLS tool”, but rather “a verified HLS tool that generates more efficient hardware designs”. I say “difficult” not because we had particular trouble getting the paper accepted – PLDI 2024 was actually the first venue we tried – but because the amount of work Yann did to add hyperblock scheduling to Vericert was about twice as much as the work he did to produce the entire first version of Vericert!

So what’s hyperblock scheduling all about, and why has it been added to an HLS tool?

As a quick primer: HLS is the process of compiling software (written in C, say) not into assembly language for a general-purpose processor, but into a design (written in Verilog, say) for a fixed-function piece of hardware that behaves like the input software. It’s popular these days because offloading computation to custom hardware accelerators is a great way to improve performance and energy efficiency, but writing those accelerators manually is laborious and error-prone. Better to specify the computation in software and then generate the hardware accelerator automatically via HLS!

Unfortunately, today’s crop of HLS tools are known to be a bit buggy, which makes them ill suited for use in a safety- or security-critical setting. This motivated the development of Vericert v1.0 – the first HLS tool mechanically proven (in Coq) never to produce a Verilog design that behaves differently from the C program it is given.

Although Vericert v1.0 had a solid correctness story, its performance was quite underwhelming compared to that of the existing (unverified) HLS tools. Its main shortcoming was that it failed to exploit the inherently parallel nature of hardware. Instead, it produced hardware that performed just one operation per clock cycle. Essentially, it took the input C program, transformed that C program into a control-flow graph (CFG), and turned that CFG into a state machine, with just one instruction per state.

More concretely, if Vericert v1.0 were given a piece of straight-line C code like this:

c = a * b;
e = c + d;
f = a * d;
g = e * b;

it would produce Verilog that looks (roughly) like this:

initial state = 0;

always @ (posedge clk)
  case (state)
    0: begin c = a * b; state = 1; end
    1: begin e = c + d; state = 2; end
    2: begin f = a * d; state = 3; end
    3: begin g = e * b; end
  endcase

and which would execute over four clock cycles (states) like this:

Now, we’d like to pack more computation into each clock cycle. We could do this really naively by packing everything into a single clock cycle. That would produce Verilog that looks (roughly) like this:

initial state = 0;

always @ (posedge clk)
  case (state)
    0: begin c = a * b; 
             e = c + d;
             f = a * d;
             g = e * b; end
  endcase

and which would execute, in a single clock cycle, like this:

Although our computation now requires fewer clock cycles, our hardware is still not good because its critical path (highlighted in pink above) is too long. That is, the clock frequency would have to be really low in order to allow time for a multiplication, then an addition, and then a second multiplication all to complete within a single cycle.

What HLS tools do in practice, then, is (1) fix a desirable target clock frequency, and then (2) try to pack as many operations into each clock cycle as possible, ensuring that no cycle has a critical path that is too long for the target frequency. This is a matter of scheduling, and it might end up producing Verilog that looks (roughly) like this:

initial state = 0;

always @ (posedge clk)
  case (state)
    0: begin c = a * b; 
             e = c + d;
             f = a * d; 
             state = 1; end
    1: begin g = e * b; end
  endcase

and which executes, in two clock cycles, like this:

This design probably offers the best compromise between minimising the number of clock cycles and maximising the clock frequency. And better still, if it can be determined that the value of c is not actually used except to hold the intermediate calculation a*b, then we can use an efficient fused multiply-add operator, like so:

We thus arrive at the first contribution of Vericert v2.0: list scheduling. This is an optimisation that takes each basic block of the C code, and decides how best to chunk it up into clock cycles, taking account of the estimated delay of each operation.

However, existing HLS tools go further than this.

The thing is, list scheduling only deals with straight-line code, like in our example. We’d prefer also to optimise code that has more interesting control flow, such as if-statements, like this:

f = a * d;
if (d > 0)
  c = a * b;
else
  g = b * c;
e = c + d;

List scheduling would approach that code as four separate basic blocks, each requiring a separate state in the Verilog state machine. However, if we perform a process called if-conversion then we end up with what is known as a hyperblock. A hyperblock is a basic block where each instruction is guarded by a predicate, and an instruction executes only if its predicate evaluates to true. If-conversion on the code above might yield the following hyperblock:

true    => f = a * d;
d > 0   => c = a * b;
d <= 0  => g = b * c;
true    => e = c + d;

We’re back to straight-line code, and can hence can apply our scheduling approach once again to see how many operations we can safely pack into each clock cycle. But our scheduler must now be a hyperblock scheduler, which means that it must deal with predicates correctly. For instance, it must take account of the fact that the predicates d > 0 and d <= 0 are mutually exclusive, so the assignments to c and g can’t both occur, and can hence be scheduled for the same clock cycle (and even for the same multiplication unit).

So that’s the essence of Vericert v2.0 – Vericert v1.0 with a hyperblock scheduler, and everything still mechanically proven correct in Coq. Vericert v2.0 is still not truly competitive with state-of-the-art unverified HLS tools, as there are still many more optimisations that it doesn’t perform, but our performance evaluation indicates that when a state-of-the-art open-source HLS tool called Bambu is restricted to use only the optimisations that Vericert v2.0 knows about, the two tools perform quite similarly. And what’s nice about Yann’s proof is that it is completely independent of how the scheduler is implemented – he confirms that each schedule produced is correct, using a Coq-verified checker, but carefully ignores how the scheduler works. This means that the scheduler can be tweaked over time without needing to revisit the correctness proof.

http://johnwickerson.wordpress.com/?p=2409

Extensions

What is the other Incorrectness logic?

John Wickerson Feb 15, 2024

The Hoare triple has a very simple meaning, namely: . That is: if the precondition is satisfied in some state , and can transform into , then must satisfy the postcondition . (We’ll allow non-deterministic commands, so need not be uniquely determined by and .) Or, more concisely: if holds before executing , then will… Continue reading What is the other Incorrectness logic? →

Show full content

The Hoare triple $\{P\}\,c\,\{Q\}$ has a very simple meaning, namely:

$\forall\sigma, \sigma'\ldotp \sigma \in P \wedge (\sigma,\sigma') \in c \longrightarrow \sigma'\in Q$ .

That is: if the precondition $P$ is satisfied in some state $\sigma$ , and $c$ can transform $\sigma$ into $\sigma'$ , then $\sigma'$ must satisfy the postcondition $Q$ . (We’ll allow non-deterministic commands, so $\sigma'$ need not be uniquely determined by $c$ and $\sigma$ .) Or, more concisely: if $P$ holds before executing $c$ , then $Q$ will hold afterwards.

The statement above can be reformulated in a couple of different ways, all equivalent. One way is to introduce Dijkstra’s concept of a weakest precondition, like so:

$pre(c,Q) = \{\sigma\mid \forall \sigma'\ldotp (\sigma,\sigma') \in c \longrightarrow \sigma'\in Q\}$

and then observe that $\{P\}\,c\,\{Q\}$ is the same as $P \subseteq pre(c,Q)$ . Another way is to introduce the concept of a strongest postcondition, also due to Dijkstra, like so:

$post(P,c) = \{\sigma'\mid \exists \sigma\ldotp \sigma\in P \wedge (\sigma,\sigma') \in c\}$

and then observe that $\{P\}\,c\,\{Q\}$ is the same as $post(P,c) \subseteq Q$ .

◆

In 2019, O’Hearn introduced Incorrectness logic. This logic involves triples $[P]\,c\,[Q]$ that have a subtly different meaning to Hoare’s, namely:

$\forall\sigma'\ldotp \sigma' \in Q \longrightarrow \exists \sigma\ldotp \sigma \in P \wedge (\sigma,\sigma') \in c$ .

That is: if the postcondition $Q$ is satisfied by some state $\sigma'$ , then there must exist some state $\sigma$ that can be transformed into $\sigma'$ by command $c$ and that satisfies the precondition $P$ . Or, more concisely: every state where $Q$ holds is reachable by executing $c$ from a state where $P$ holds.

There’s a neat duality between Incorrectness logic and Hoare logic:

$\{P\}\,c\,\{Q\}$ is equivalent to $post(P,c) \subseteq Q$

$[P]\,c\,[Q]$ is equivalent to $post(P,c) \supseteq Q$

What’s curious is that this $\subseteq$ / $\supseteq$ trick can only be used with the strongest-postcondition formulation. If you try it on the weakest-precondition formulation, you don’t get Incorrectness logic. In other words, $P \subseteq pre(c,Q)$ and $post(P,c) \subseteq Q$ are logically equivalent, but $P \supseteq pre(c,Q)$ is not the same as $post(P,c) \supseteq Q$ . If $post(P,c) \supseteq Q$ corresponds to Incorrectness logic, then what does $P \supseteq pre(c,Q)$ correspond to?

◆

So let’s see what happens if we introduce a third type of triple, $\langle P\rangle\,c\,\langle Q\rangle$ , which shall mean:

$\forall \sigma\ldotp (\forall \sigma' (\sigma,\sigma') \in c \longrightarrow \sigma' \in Q) \longrightarrow \sigma \in P$ .

That is: if $\sigma$ is a state that is always transformed by command $c$ into one that satisfies the postcondition $Q$ , then $\sigma$ must satisfy the precondition $P$ . Or, more concisely: if $P$ doesn’t hold before executing $c$ , then $Q$ will not hold afterwards.

Essentially, we are filling in the gap marked “???” in the picture below (where the thick arrows indicate equivalences):

What can we say about these new $\langle P \rangle c \langle Q \rangle$ triples? What properties do they enjoy?

In some ways, all three forms of triple are alike. For instance, they all admit the same Sequencing rule:

$\displaystyle\frac{ \{ P \} c_1 \{ Q \} \quad \{ Q \} c_2 \{ R \} } { \{ P \} c_1 ; c_2 \{ R \} }$

$\displaystyle\frac{ [ P ] c_1 [ Q ] \quad [ Q ] c_2 [ R ] } { [ P ] c_1 ; c_2 [ R ] }$

$\displaystyle\frac{ \langle P \rangle c_1 \langle Q \rangle \quad \langle Q \rangle c_2 \langle R \rangle } { \langle P \rangle c_1 ; c_2 \langle R \rangle }$

and they all admit the Iteration rule:

$\displaystyle\frac{ \{ P \} c \{ P \} } { \{ P \} c^* \{ P \} }$ $\displaystyle\frac{ [ P ] c [ P ] } { [ P ] c^* [ P ] }$ $\displaystyle\frac{ \langle P \rangle c \langle P \rangle } { \langle P \rangle c^* \langle P \rangle }$

In some ways, the new triples behave more like Incorrectness triples than like Hoare triples. This is the case for the Skip rules:

$\displaystyle\frac{ P \Rightarrow Q } { \{ P \} \text{skip} \{ Q \} }$ $\displaystyle\frac{ Q \Rightarrow P } { [ P ] \text{skip} [ Q ] }$ $\displaystyle\frac{ Q \Rightarrow P } { \langle P \rangle \text{skip} \langle Q \rangle }$

and for the Consequence rules:

$\displaystyle\frac{ P' \Rightarrow P \quad \{ P \} c \{ Q \} \quad Q \Rightarrow Q' } { \{ P' \} c \{ Q' \} }$

$\displaystyle\frac{ P \Rightarrow P' \quad [ P ] c [ Q ] \quad Q' \Rightarrow Q } { [ P' ] c [ Q' ] }$

$\displaystyle\frac{ P \Rightarrow P' \quad \langle P \rangle c \langle Q \rangle \quad Q' \Rightarrow Q } { \langle P' \rangle c \langle Q' \rangle }$

In yet other ways, all three forms of triple are different. For instance, they all say different things about the command $\varnothing$ that always diverges (fails to terminate):

$\{ P \} \varnothing \{ Q \}$ always holds, because Hoare logic gives no guarantees about non-terminating executions.
$[ P ] \varnothing [ Q ]$ holds iff $Q = False$ , because Incorrectness logic demands termination.
$\langle P \rangle \varnothing \langle Q \rangle$ holds iff $P = True$ , because $\varnothing$ will always fail to reach $Q$ .

Another way to distinguish the three forms is to ask what they say when the postcondition is unsatisfiable. (In what follows, let $diverges(c)$ be the set of states from which execution of $c$ diverges.)

$\{ P \} c \{ False \}$ holds iff $P \Rightarrow diverges(c)$ ; i.e. $P$ is a sufficient condition for $c$ diverging.
$[ P ] c [ False ]$ always holds.
$\langle P \rangle c \langle False \rangle$ holds iff $diverges(c) \Rightarrow P$ ; i.e. $P$ is a necessary condition for $c$ diverging.

Here are some rules about $c_1 \cup c_2$ , which denotes a non-deterministic choice between commands $c_1$ and $c_2$ :

$\displaystyle\frac{ \{ P_1 \} c_1 \{ Q_1 \} \quad \{ P_2 \} c_2 \{ Q_2 \} } { \{ P_1 \wedge P_2 \} c_1 \cup c_2 \{ Q_1 \vee Q_2 \} }$

$\displaystyle\frac{ [ P_1 ] c_1 [ Q_1 ] \quad [ P_2 ] c_2 [ Q_2 ] } { [ P_1 \vee P_2 ] c_1 \cup c_2 [ Q_1 \vee Q_2 ] }$

$\displaystyle\frac{ \langle P_1 \rangle c_1 \langle Q_1 \rangle \quad \langle P_2 \rangle c_2 \langle Q_2 \rangle } { \langle P_1 \wedge P_2 \rangle c_1 \cup c_2 \langle Q_1 \wedge Q_2 \rangle }$

Here are some rules for conjoining or disjoining two different specifications for the same command:

$\displaystyle\frac{ \{ P_1 \} c \{ Q_1 \} \quad \{ P_2 \} c \{ Q_2 \} } { \{ P_1 \vee P_2 \} c \{ Q_1 \vee Q_2 \} }$

$\displaystyle\frac{ \{ P_1 \} c \{ Q_1 \} \quad \{ P_2 \} c \{ Q_2 \} } { \{ P_1 \wedge P_2 \} c \{ Q_1 \wedge Q_2 \} }$

$\displaystyle\frac{ [ P_1 ] c [ Q_1 ] \quad [ P_2 ] c [ Q_2 ] } { [ P_1 \vee P_2 ] c [ Q_1 \wedge Q_2 ] }$

$\displaystyle\frac{ \langle P_1 \rangle c \langle Q_1 \rangle \quad \langle P_2 \rangle c \langle Q_2 \rangle } { \langle P_1 \wedge P_2 \rangle c \langle Q_1 \wedge Q_2 \rangle }$

Here are some ways to translate between Hoare triples and Incorrectness triples. (In what follows, let $determ(c)$ be the set of states from which execution of $c$ is deterministic, and let $c^{-1}$ denote the inverse of the command $c$ .)

If $P \cap diverges(c) = \emptyset$ then $\{ P \} c \{ Q \}$ implies $[Q] c^{-1} [P]$ .
If $P \subseteq determ(c)$ then $[Q] c^{-1} [P]$ implies $\{ P \} c \{ Q \}$ .
Hence, if $P \cap diverges(c) = \emptyset$ and $P \subseteq determ(c)$ then $\{ P \} c \{ Q \}$ coincides with $[Q] c^{-1} [P]$ .

Here are some ways to translate between Hoare triples and the new form of triples.

If $P \cap diverges(c) = \emptyset$ then $\{ P \} c \{ Q \}$ implies $\langle \neg P \rangle c \langle \neg Q \rangle$ .
If $P \subseteq determ(c)$ then $\langle \neg P \rangle c \langle \neg Q \rangle$ implies $\{ P \} c \{ Q \}$ .
Hence, if $P \cap diverges(c) = \emptyset$ and $P \subseteq determ(c)$ then $\{ P \} c \{ Q \}$ coincides with $\langle \neg P \rangle c \langle \neg Q \rangle$ .

If we become a bit more concrete now, and model program states simply as functions from variables to integers, then we can obtain specifications of the “increment $x$ ” command like so:

$\{ x = n \}\,{+}{+}x \,\{x = n + 1\}$
$[ x = n ] \, {+}{+}x \, [ x = n + 1 ]$
$\langle x = n \rangle \, {+}{+}x \, \langle x = n + 1 \rangle$

or:

$\{ x > n \} \, {+}{+}x \, \{x > n + 1\}$
$[ x > n ] \, {+}{+}x \, [ x > n + 1 ]$
$\langle x > n \rangle \, {+}{+}x \, \langle x > n + 1 \rangle$

For specifying an “assign a constant” command, we can use:

$\{ x > 2 \} \, x := 5 \, \{ x = 5 \}$
$[ x > 2 ] \, x := 5 \, [ x = 5 ]$

but $\langle x > 2 \rangle \, x := 5 \, \langle x = 5 \rangle$ is invalid because $x>2$ is not necessary to guarantee $x = 5$ after the assignment. In fact, any initial state will suffice here, which means $\langle True \rangle \, x := 5 \, \langle x = 5 \rangle$ is valid.

We can also specify that assignment command more loosely:

$\{ x > 2 \} \, x := 5 \, \{ x > 3 \}$

but $[ x > 2 ] \, x := 5 \, [ x > 3 ]$ is invalid because executing $x := 5$ can never reach a final state where $x = 7$ , say. And $\langle x > 2 \rangle \, x := 5 \, \langle x > 3 \rangle$ is invalid too, for the same reason as given above.

Finally, we can specify some non-deterministic commands, such as a command that assigns to $x$ either 4 or 5:

$\{ True \} \, (x := 4) \cup (x := 5) \, \{ x = 4 \vee x = 5 \}$
$[ True ] \, (x := 4) \cup (x := 5) \, [ x = 4 \vee x = 5 ]$
$\langle True \rangle \, (x := 4) \cup (x := 5) \, \langle x = 4 \vee x = 5 \rangle$

or a command that either increments $x$ or does nothing:

$\{ x > n \} \, ({+}{+}x) \cup \text{skip} \, \{ x > n \}$
$[ x > n ] \, ({+}{+}x) \cup \text{skip} \, [ x > n ]$
$\langle x > n \rangle \, ({+}{+}x) \cup \text{skip} \, \langle x > n \rangle$

◆

In terms of related work:

O’Hearn does mention the curious inequivalence between $P \supseteq pre(c,Q)$ and $post(P,c) \supseteq Q$ in his POPL’19 paper (near the end of Section 7), but doesn’t dig into the implications of building a proof system around triples that mean $P \supseteq pre(c,Q)$ .
He does, however, point out some earlier work by Cousot et al. that made use of these triples in an automated verification context, which was published in VMCAI 2013. Cousot et al. referred to them as necessary preconditions, and they used them because they wanted to avoid false positives in their automatic tool. However, they didn’t develop a set of proof rules for reasoning about these triples.
More recently, Zilberstein, Dreyer, and Silva’s OOPSLA 2021 paper proposes Outcome logic as a way to unify $\{P\}\,c\,\{Q\}$ triples and $[P]\,c\,[Q]$ triples into a single logic based on monads and monoids. I haven’t yet figured out whether Outcome logic also covers $\langle P\rangle \,c\,\langle Q\rangle$ triples.

To conclude: $\langle P\rangle \,c\,\langle Q\rangle$ gives rise to a proof system that seems to have some interesting properties, and perhaps it shines some light on the nature of Hoare logic and Incorrectness logic by where it differs from them. I’m not convinced that it’s a useful proof system like Hoare logic or Incorrectness logic certainly are. Still, it’s always satisfying to fill in “gaps”!

An Isabelle-2023 proof script that formalises all of the results given above is available.

http://johnwickerson.wordpress.com/?p=2305

Extensions

CXL: What’s all the fuss about?

John Wickerson Aug 24, 2023

The world of computer architecture is quite excited these days about something called Compute Express Link (CXL). It’s a new standard that allows the various components of a datacentre computer to communicate large amounts of data efficiently with each other. In this article, I’ll explain what CXL is, and why there is so much excitement… Continue reading CXL: What’s all the fuss about? →

Show full content

The world of computer architecture is quite excited these days about something called Compute Express Link (CXL). It’s a new standard that allows the various components of a datacentre computer to communicate large amounts of data efficiently with each other.

In this article, I’ll explain what CXL is, and why there is so much excitement about the possibilities that it opens.

I’ll start by talking about the situation before CXL came along.

The Way Things Are

We are in the age of parallelisation and specialisation. That is, it’s no longer practical to push all our computation through a single processor that runs very quickly; instead, we have to spread our computation out across several processors, all running in parallel. These various processors need not be identical; indeed, there are advantages to having different kinds of processors, all specialised for different tasks, and arranging that the overall computation is split up so as to play to the strengths of the available processors. For instance, a CPU might be well suited for handling user interaction, while the task of iterating through a large array might be handed off to a GPU.

Of course, now that we have spread out our processors in this way, we need to ensure that the data that our processors need is in the right place at the right time. And since we are also in the age of Big Data, we have a lot of data to move around.

It’s not practical to keep all data in a single memory — the delay associated with each read or write would be intolerable. CPUs tackle this problem by using caches: locations that are likely to be needed soon are copied into small ‘cache’ memories that are fast to access. Cache-coherence protocols are employed to ensure that data in those copies remains in-sync with the contents of the main memory.

Cache-coherence works nicely within a single CPU, but what about our heterogeneous system made up of CPUs, GPUs, and other devices?

Let’s consider a simple heterogeneous system made up of a CPU and a GPU, as shown in the picture below.

There are a couple of ways for the GPU to access data in the CPU’s memory. The traditional way is to copy data en-masse from the CPU’s memory into the GPU’s memory, and then to copy it back once the computation is complete. But this can involve an intolerable amount of data movement, much of which may be unnecessary.

An alternative is to give the GPU direct access to the locations in the CPU’s memory that it actually needs to read or write. This can reduce the amount of data flowing between the processors, but each individual memory access by the GPU now becomes very slow, because it has to go all the way to the CPU’s memory to get the freshest data. It cannot rely on copies of CPU memory locations that it may have stored in its own caches, because they may contain out-of-date values. After all, as shown in the picture above, the GPU is not part of the CPU’s cache-coherence domain, so it is not notified if the CPU overwrites a memory location that the GPU has cached.

The usual workaround is to ensure that the CPU does not access any of the memory locations to which the GPU has access until the GPU has finished with them. Naturally, this restricts what the CPU is able to do while the GPU is performing its task.

Enter CXL

The key idea of CXL is to extend the cache-coherence domain so that it covers not just the CPU, but all of the connected processors too. This gives us the following picture.

Why is this a good thing? Well, the CPU and the GPU can now cache any location in the CPU’s memory, safe in the knowledge that the cache-coherence protocol will notify them if the cached data goes out-of-date. There is no longer any need for the CPU to avoid memory locations that the GPU has been given access to.

So why wasn’t this done before? Well, it kinda was. For instance, AMD developed Infinity Fabric, which is used to provide a cache-coherent interconnect between AMD CPUs and AMD GPUs on Frontier, the fastest supercomputer in the world since 2020. Before that, NVIDIA developed NVLink, which did the same for IBM POWER9 CPUs and NVIDIA GPUs in Summit, the fastest supercomputer in the world before 2020. OpenCAPI is another cache-coherent interconnect protocol for IBM CPUs. The problem is that all of these protocols are vendor-specific. It is not possible to connect, for instance, an AMD CPU and an NVIDIA GPU, because the cache-coherence mechanisms used by one device are not meaningful to the other device.

One attempt that came close to being a vendor-independent standard was CCIX (Cache-Coherent Interconnect for Accelerators), which arose in the mid-2010s. It had the backing of several big players in the industry (AMD, Xilinx, Huawei, and Arm) but lacked broader support, most notably from Intel, and is now considered dead. CXL was proposed by Intel in 2019, and quickly attained broad support.

A Killer Application: Memory Pooling

As I’ve described above, CXL promises to make it easier for CPUs, GPUs, and other devices to communicate large amounts of data efficiently with each other.

But the CXL application that has the most industry buzz around it is called memory pooling.

Given that CXL can be used to allow devices to quickly access each other’s memory, the question arises: can we use CXL to connect devices that solely contain memory?

The picture above is enticing to a datacentre operator, because it shows how memory and processors can be disentangled. In our previous setup, each processor had a fixed amount of memory attached to it. This fixed amount had to be rather large, in order to suffice for the most memory-intensive workloads that the processor handles. But large amounts of memory are expensive, and anyway, most of that memory is likely unused most of the time.

But once we have CXL-connected memory, then this memory can be used by any processor that is also CXL-connected. It can be allocated to one processor for a while, then returned to the ‘pool’, and then allocated to a different processor that needs it. This can result in the system requiring far less memory attached to each processor, and hence far less memory overall, and hence substantial cost savings for the datacentre operator.

Sources and Further Reading

http://johnwickerson.wordpress.com/?p=2283

Extensions

The fastest route between King’s Cross St Pancras and South Kensington

John Wickerson May 11, 2023

I’ve commuted via King’s Cross St Pancras to Imperial College’s South Kensington Campus for about a decade now, but have only recently discovered what I believe to be the fastest possible route between those two London stations. King’s Cross St Pancras to South Kensington takes me about 20 minutes, and the return journey takes about… Continue reading The fastest route between King’s Cross St Pancras and South Kensington →

Show full content

I’ve commuted via King’s Cross St Pancras to Imperial College’s South Kensington Campus for about a decade now, but have only recently discovered what I believe to be the fastest possible route between those two London stations. King’s Cross St Pancras to South Kensington takes me about 20 minutes, and the return journey takes about 17.

From King’s Cross St Pancras to South Kensington

I’ll assume you’re standing outside King’s Cross railway station. Enter the Underground station via the entrance shown in the photo below.

*Starting point outside King’s Cross railway station.*

There are two things you need to know about the King’s Cross St Pancras Underground station: one is that they built a new ticket hall in 2009 (the ‘northern’ ticket hall), and the other is that you should never use it. Wherever you’re coming from and wherever you’re going, if you find yourself in the northern ticket hall then you’ve gone the long way round.

So although you’re aiming for the Victoria line, you shouldn’t actually follow the signs that tell you to go right for the Victoria line. They’ll take you via the northern ticket hall. Instead, go left.

*The sign says go right for the Victoria line, but you should go left.*

This will take you to another ticket hall (the ‘Tube’ ticket hall). From this point, you can safely follow the signs for the Victoria line (southbound). Once you step onto the platform, don’t be tempted to move; you’ll want to get on the train at its frontmost door.

*Stand here on the southbound Victoria line platform.*

Get off the train at Victoria. You are now aiming for the District and Circle lines. In front of you is a sign inviting you to walk back along the platform if you want the District and Circle lines. It is a lie. Ignore it, and follow the signs for National Rail instead.

You will soon see another sign insisting that you are on a route with no access to the District and Circle lines. It is also a lie. Ignore it, and proceed up the escalator in front of you.

At the top of the escalator, turn left, then left again to go down another escalator. You can now follow signs for the District and Circle lines (westbound).

*Follow the signs for the District and Circle lines (westbound).*

You’ll want to board the train at the front of the third car. This requires you to walk left along the platform until the ground under your feet looks like the photo below.

*Stand here on the westbound District and Circle lines platform.*

When you get off the train at South Kensington you should find the stairs directly in front of you. (Unfortunately this station has no step-free exit.)

From South Kensington to King’s Cross St Pancras

Starting at the ticket hall of the South Kensington Underground station, take the stairs down to the District and Circle lines platform. You’re heading eastbound, so keep left at the bottom of the stairs. Keep walking straight ahead, until you’re most of the way along the platform. You’ll want to board the train at the back of the second car, which means you should stand outside the green door of the Customer Service Manager’s office.

*Stand here on the eastbound District and Circle lines platform.*

Get off the train at Victoria and follow the signs for the Victoria line (northbound). When you step onto the platform you’ll want to turn left and head all the way down to the far end, so you can board the train at the very rear. This puts you in the right position to make a quick exit at King’s Cross St Pancras, being sure – as always – to avoid that northern ticket hall!

http://johnwickerson.wordpress.com/?p=2254

Extensions

Hyphenation: when and why?

John Wickerson Jan 27, 2023

When writing multi-word adjectives, hyphens can be important. But many writers use either too many or too few. This blog post is my attempt to clarify the situation (as I understand it). The basic issue is that “adjective noun noun” can be hard for readers to parse. The default is to understand it as (adjective… Continue reading Hyphenation: when and why? →

Show full content

When writing multi-word adjectives, hyphens can be important. But many writers use either too many or too few. This blog post is my attempt to clarify the situation (as I understand it).

The basic issue is that “adjective noun noun” can be hard for readers to parse. The default is to understand it as (adjective (noun noun)). For instance, “benign data race” would usually be read as “a data race that is benign” rather than “a race on benign data”.

So, if you want to indicate the other parsing, ((adjective noun) noun), you should add a hyphen. For instance, “shared-memory operation” is parsed as “an operation on shared memory” rather than “a memory operation that is shared”.

Note that “adverb adjective noun” doesn’t have this problem, because (adverb (adjective noun)) isn’t a valid parsing. For instance, a “weakly consistent system” has to be a “system that is weakly consistent” because a “consistent system that is weakly” doesn’t make grammatical sense. So there’s no need for hyphenation here.

Featured photo by Aaron Burden on Unsplash.

http://johnwickerson.wordpress.com/?p=2249

Extensions

Proving the existence of God using a computer

John Wickerson Dec 5, 2022

My friend and colleague George Constantinides wrote an interesting post on his blog recently where he formalises St Anselm’s argument for the existence of God. Since I’m teaching a course on the Isabelle proof assistant this term, I thought I’d see if I could recreate that argument using Isabelle. Here goes… As can be seen… Continue reading Proving the existence of God using a computer →

Show full content

My friend and colleague George Constantinides wrote an interesting post on his blog recently where he formalises St Anselm’s argument for the existence of God. Since I’m teaching a course on the Isabelle proof assistant this term, I thought I’d see if I could recreate that argument using Isabelle.

Here goes…

theory God imports Main begin

(* We assume a fixed set of "beings". *)
typedecl Being

theorem 
  (* Some beings exist in reality. The others
  exist only in the mind. *)
  fixes exists_in_reality :: "Being set"

  (* Suppose that the idea of "greatness" imposes
  a (strict) partial order on the set Being. That is, 
  some beings are greater than other beings. The 
  order is only partial, which means that it is 
  not always possible to compare the greatness of
  two beings. NB: strict partial order = 
  transitive + irreflexive. *)
  fixes greater_than :: "Being rel"
  assumes "trans greater_than"
  assumes "irrefl greater_than"

  (* We shall say that a being is godlike if
  there is no being that is greater than it. *)
  fixes godlike :: "Being set"
  assumes "∀b ∈ godlike. 
    ∄b'. (b',b) ∈ greater_than"

  (* The concept of god exists in the mind and 
  so is a member of Being. *)
  assumes "godlike ≠ {}"

  (* If two beings are identical except that one 
  exists in reality and one does not, then the 
  one that exists in reality is greater than the 
  one that doesn't. I've formalised a weaker 
  version: for every being that does not exist 
  in reality, there exists another being that 
  does exist in reality, and is greater. *)
  assumes "∀b. b ∉ exists_in_reality ⟶ 
    (∃b' ∈ exists_in_reality. 
      (b',b) ∈ greater_than)"

  (* God exists. *)
  shows "∃g ∈ godlike. g ∈ exists_in_reality"
  using assms by blast

end

As can be seen by the brevity of the final “using assms by blast” line, Isabelle didn’t have much trouble accepting this argument once all of the premises were laid out precisely. Then again, as for whether all of those premises are reasonable, I’m not so sure! I think St Anselm was taking some liberties in his original version, and I’m sure I’ve introduced some oversimplications in my own formulation.

Perhaps this is a theorem where a constructive proof would be more convincing…

http://johnwickerson.wordpress.com/?p=2228

Extensions

“A Squash and a Squeeze” – Academic Edition

John Wickerson Jul 14, 2022

A little old lady worked all on her own On a paper about a result she had shown. A wise old man heard her grumble and rage: "There’s not enough room on my page! Wise old man, won’t you help me, please? My paper’s a squash and a squeeze."… Continue reading “A Squash and a Squeeze” – Academic Edition →

Show full content

With apologies to Julia Donaldson and Axel Scheffler

A little old lady worked all on her own
On a paper about a result she had shown.
A wise old man heard her grumble and rage:
“There’s not enough room on my page!
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Bring in more proofs,” said the wise old man.
“Bring in more proofs? What a curious plan.”
So she added a proof of each lemma in turn
But found that this led to increasing concern.
The little old lady cried, “God in his heaven!
My paper was ten pages; now it’s eleven!
Printing this out will use twenty-three trees!
My paper’s a squash and a squeeze!
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Bring in more graphs,” said the wise old man.
“Bring in more graphs? What a curious plan.”
So she added some bar charts and one scatter plot
But found that this lengthened her work quite a lot.
The little old lady cried, “Though I have tried,
My paper’s now entered its seventeenth side!
Should I remove all the words that have e’s?
My paper’s a squash and a squeeze!
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Bring in more quotes,” said the wise old man.
“Bring in more quotes? What a curious plan.”
So she added a quote and it made her work stronger,
But, lo and behold, her paper got longer!
The little old lady cried, “Please help me trim it,
I need to get down to my editor’s limit.
I added quotations and, well, quelle surprise:
My paper’s a squash and a squeeze!
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Take them all out,” said the wise old man.
“But then I’ll be back where I first began!”
So she made an appendix for all of her proofs
(Which anyway probably had a few goofs),
She chopped out the graphs and replaced them with tables
(She never did like all those x-axis labels),
She took out the quotes, and she had a good stare:
“I’m under the limit with four lines to spare!”
“Thank you, old man, for your canny suggestions!
Your strategy certainly raised a few questions,
But worked like a treat. So when I next edit
My paper I’ll give you appropriate

Credit.”

http://johnwickerson.wordpress.com/?p=2214

Extensions

Some reflections on FCCM 2022

John Wickerson May 18, 2022

FCCM 2022 in New York City has just drawn to a close, so I thought I’d put down some thoughts about my experience while it’s fresh in my mind. As my first in-person conference since covid, my expectations were sky high, and I’m very pleased to report that my expectations were well and truly met… Continue reading Some reflections on FCCM 2022 →

Show full content

FCCM 2022 in New York City has just drawn to a close, so I thought I’d put down some thoughts about my experience while it’s fresh in my mind. As my first in-person conference since covid, my expectations were sky high, and I’m very pleased to report that my expectations were well and truly met – I’ve had a fantastic time. It was really wonderful to re-connect with colleagues and collaborators from across the world, and to make new connections too. I really enjoyed the highly interactive “FLASHLIGHT” workshop on formal methods and high-level synthesis that I co-organised as a satellite event. The main conference had an interesting and varied programme, and it was so nice to be able to engage in direct conversation with the authors both in the Q&As after the talks and in the coffee breaks afterwards. Here is a handful of papers that have stuck in my mind:

Aman Arora presented a new design of FPGA in which block RAMs are augmented with their own built-in processing elements. The processing elements are not particularly exciting in themselves – for instance, they can only manage to process two or three bits at a time – but they have fantastically low latency because the data doesn’t need to travel across the FPGA fabric, from the RAM to the logic blocks and back again. An interesting challenge for the hardware compilers of the near future will be to work out which parts of the computation should be mapped to these low-latency, low-throughput “CoMeFa” RAMs, and which parts should remain in the traditional logic blocks.
Ecenur Üstün explained that there are several ways to decompose high-bitwidth integer multiplications into multiple lower-bitwidth operations, and presented her tool for exploring that design space using a technique called “equality saturation”.
Lana Josipović explained how to add resource sharing to her dynamically-scheduled HLS tool, in order to make the circuits it produces consume less area on the FPGA fabric. This is easy in the static-scheduling setting because you can see at compile-time that a given pair of operations are never executed simultaneously and hence can be time-multiplexed onto the same functional unit; in the dynamic-scheduling setting, this information doesn’t become apparent until run-time, which is too late!
Funnily enough, my student Michalis Pardalos‘s presentation was also about adding resource sharing to an HLS tool; on the one hand, his setting is easier than Lana’s because he assumes static scheduling, but on the other hand, his setting was more challenging because he is working on a computer-checked mathematical proof that the resource sharing is being carried out correctly.
Another of my students, Jianyi Cheng, showed how to extend HLS with a technique called C-slow pipelining, which can be thought of as a hardware pipeline with a cycle in it. An audience member expressed appreciation for Jianyi’s carefully crafted Powerpoint animations of these cyclic pipelines in action, which was heartening to see.
Nicholas Beckwith explained that HLS tools tend to generate designs with quite “spread out” memory access patterns; this is fine for ordinary RAM, but leads to strikingly poor performance when using persistent memory like Intel’s Optane DC, which accesses memory at the granularity of a “block” at a time. Nicholas showed how performance can be regained by re-mapping memory locations so as to minimise the number of blocks that need accessing.

FCCM 2022 was the first “hybrid” conference I’ve attended. Roughly half of its 275 participants were in-person and half were online. As for the talks themselves: about 15 were in-person and 10 were delivered live over Zoom. (One of the online talks was actually meant to be in-person, but the speaker self-quarantined upon arrival in NYC after coming down with flu-like symptoms.) There were just a handful of questions from remote participants throughout the whole conference, each of which was read out by the session chair.

It was notable that our venue (Cornell Tech) was extremely well equipped for a hybrid conference, with top-of-the-range sound systems, oodles of microphones of all shapes and sizes, lecterns with built-in Zoom support, and high-resolution cameras pointing at the speaker and at the audience. The system was fantastically complicated and required seemingly constant attention from the small army of hardworking volunteers. I suspect that a smaller-scale conference at a less modern venue would struggle.

The hybridisation was clearly an enormous effort for the organisers, but I was struck by how little “overhead” the hybridisation imposed on the in-person participants. We had to remember to speak into the microphone during the Q&As so that the remote participants could hear us, but other than that, there was barely any indication that this was a hybrid conference.

http://johnwickerson.wordpress.com/?p=2199

Extensions

The “Cluedo” effect in randomised testing

John Wickerson Mar 30, 2022

When playing the board game Cluedo (also known as Clue in North America), you make a series of hypotheses that consist of a murderer, a weapon, and a room. In each hypothesis, the room has to be the room your piece is currently in, and the player to your left always has the first opportunity… Continue reading The “Cluedo” effect in randomised testing →

Show full content

When playing the board game Cluedo (also known as Clue in North America), you make a series of hypotheses that consist of a murderer, a weapon, and a room. In each hypothesis, the room has to be the room your piece is currently in, and the player to your left always has the first opportunity to invalidate your hypothesis by showing you one of their cards.

So, it’s quite infuriating when you discover that the player to your left owns the card of the room you’re in, because once they’ve shown you that card, you’re not going to get any more information — either from them or from any other player — until you’ve moved your piece out of that room. That’s because even if you change the murderer or the weapon in your hypothesis, they’re (presumably) just going to keep showing you their room card.

The situation is much the same with randomised testing, or fuzzing.

Once you’ve found one bug in the software-under-test, it’s often the case that you’ll keep finding that same bug again and again, even with different random test-cases, especially if the bug is in a commonly used feature. This phenomenon, which has been written about by Hughes et al. among others, can mask any other bugs that might be present. In this situation, the analogue to “moving your Cluedo piece out of the room” is either: (1) identifying the buggy feature and making sure that subsequent test-cases avoid it, or (2) fixing the bug in the software-under-test before proceeding.

Thanks to David MacIver for pointing me to Hughes et al.’s paper.

Video game cassette, u2018Cluedo: The great detective gameu2019 (video game)

http://johnwickerson.wordpress.com/?p=2189

Extensions

High-level synthesis, but correct

John Wickerson Aug 27, 2021

This post is about a paper by Yann Herklotz, James Pollard, Nadesh Ramanathan, and myself that will be presented shortly at OOPSLA 2021. High-level synthesis (HLS) is an increasingly popular way to design hardware. It appeals to software developers because it allows them to exploit the performance and energy-efficiency of custom hardware without having to learn… Continue reading High-level synthesis, but correct →

Show full content

This post is about a paper by Yann Herklotz, James Pollard, Nadesh Ramanathan, and myself that will be presented shortly at OOPSLA 2021.

High-level synthesis (HLS) is an increasingly popular way to design hardware. It appeals to software developers because it allows them to exploit the performance and energy-efficiency of custom hardware without having to learn Verilog, and it appeals to hardware engineers because it offers them the convenient abstractions of high-level software languages like C.

However, there are concerns about its reliability. For instance, Andrew Canis, co-founder of LegUp Computing, has described the HLS process as “inherently prone to introducing bugs or regressions in the final circuit functionality”. Indeed, our own recent work gave credence to those concerns by uncovering several instances where mainstream HLS tools really do compile C programs into incorrect Verilog designs. This lack of reliability undermines the usefulness of HLS in safety-critical contexts.

To address this problem, we have built a new open-source HLS tool, called Vericert, that is mechanically proven correct. This means that whenever Vericert compiles a C program into a Verilog design, the Verilog design is guaranteed to behave in exactly the same way as the C program. To achieve this, we extended an existing verified C compiler called CompCert with a new hardware-specific intermediate language and a new Verilog backend. To phrase Vericert’s correctness theorem, we rely on a Verilog semantics developed by Lööw and Myreen, which we lightly extended to handle a few more language features that we needed. Vericert currently supports most C constructs, including integer operations, function calls (all inlined), local arrays, structs, unions, and general control-flow statements, but currently excludes support for case statements, function pointers, recursive function calls, non-32-bit integers, floats, and global variables.

We used the PolyBench/C benchmark suite to evaluate the performance of Vericert in comparison to an existing open-source HLS tool called LegUp. Area-wise: Vericert fares quite well, generating designs that are only around 10% larger than those generated by LegUp. Speed-wise: we initially found Vericert’s designs to run about 27x slower (!) than LegUp’s. This can mostly be blamed on Vericert’s naive strategy of cramming each operation into a single clock cycle, because whenever complicated operations like divisions are present, the clock has to run really slowly to give them time to complete. After replacing all the divisions in PolyBench/C with iterated shifts, we found Vericert’s designs to be about 2x slower than LegUp’s, which is much less objectionable. The remaining gap can be largely attributed to Vericert’s current lack of support for performing operations in parallel, and that’s a feature we’re working to add at the moment.

Of course, the main selling point of Vericert is its watertight guarantee of correctness, and this is something that we have evaluated empirically too. We used Csmith to generate thousands of random C programs and fed each one into Vericert. When we did this with other HLS tools, they tended to miscompile a few percent of the test programs, but we found that Vericert never produced an incorrect output (once we fixed a little bug in our pretty-printer, which is unverified). To be clear: Vericert quite frequently failed to produce any output at all, due to its lack of support for a few C language features that appeared in the test programs, but it never produced incorrect output.

Vericert is fully open-source and available on Github.

http://johnwickerson.wordpress.com/?p=2183

Extensions

What’s the Rush?

John Wickerson Aug 17, 2021

Originally posted on Thinking:
At FPL 2021, my PhD student Jianyi Cheng (jointly supervised by John Wickerson) will present our short paper “Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining for HLS”. In this post, I explain the simple idea behind this paper and how it can significantly accelerate certain neglected corner…

Show full content

Thinking

At FPL 2021, my PhD student Jianyi Cheng (jointly supervised by John Wickerson) will present our short paper “Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining for HLS”. In this post, I explain the simple idea behind this paper and how it can significantly accelerate certain neglected corner cases in high-level synthesis (HLS).

By far the most significant way to extract high performance from a hardware accelerator in high-level synthesis is to use loop pipelining. Loop pipelining is the idea of starting the next iteration of a loop before the previous one finishes, allowing multiple iterations to be executing simultaneously. However, some loop iterations may need a result produced by earlier loop iterations, limiting the extent to which this can be done. HLS tools generally determine a ‘safe’ initiation interval – the number of clock cycles between starting two adjacent loop iterations – and…

View original post 349 more words

http://johnwickerson.wordpress.com/2021/08/17/whats-the-rush/

Extensions

Understanding the memory semantics of multi-threaded CPU/FPGA programs

John Wickerson Jul 3, 2021

If you’ve ever attended a seminar about weak memory models, chances are good that you’ve been shown a small concurrent program and asked to ponder what is allowed to happen if its threads are executed on two or three different cores of a multicore CPU. For instance, you might be shown this program: and asked… Continue reading Understanding the memory semantics of multi-threaded CPU/FPGA programs →

Show full content

If you’ve ever attended a seminar about weak memory models, chances are good that you’ve been shown a small concurrent program and asked to ponder what is allowed to happen if its threads are executed on two or three different cores of a multicore CPU. For instance, you might be shown this program:

// Thread 1:   ||  // Thread 2:
x := 1;        ||  y := 1;
r0 := y;       ||  r1 := x;

and asked what values can end up in the registers r0 and r1 if the program is executed on a multicore x86 machine, assuming that the shared locations x and y are both initially set to 0. The speaker might enjoy informing audience members unfamiliar with the intricacies of x86 memory semantics that the seemingly impossible outcome r0=r1=0 is actually allowed.

In a paper that Dan Iorga, Tyler Sorensen, Ally Donaldson and I will shortly present at OOPSLA 2021, we ask a slightly different question: what is allowed to happen if one of those threads is executed by a CPU core, and the other is simultaneously executed by a custom hardware design implemented on an FPGA?

That’s not quite as contrived a situation as you might think. In our era of heterogeneity, computing is increasingly being done by systems-on-chip that are made up of several different computing components (conventional CPUs, powerful GPUs, and programmable FPGAs), all with the potential to work together on a problem, each tackling the part that they are best suited to.

So what might happen if we run that program on a system-on-chip such as Intel’s recently-released Xeon+FPGA processor, which packs a multicore Xeon CPU and an Arria FPGA onto a single chip?

// Thread 1 (CPU):   ||  // Thread 2 (FPGA):
x := 1;              ||  y := 1;
r0 := y;             ||  r1 := x;

There are several reasons why weird behaviours, like the r0=r1=0 one we saw earlier, might still be allowed to happen. For one, the write buffers on the x86 CPU might cause the write of x to be delayed until after the read of y. For another, the FPGA might allow similar reordering; indeed, on the particular system-on-chip that we studied, memory access requests by the FPGA can be batched up and reordered before being dispatched. For a third, there are multiple channels that link the FPGA to the memory that it shares with the CPU, and memory requests in one channel might overtake those in another channel.

And there are other examples of concurrent programs where the CPU/FPGA combo can yield behaviours even more weird than those allowed on the CPU alone.

In our work, we’ve designed a mathematical model that explains which memory accesses by the CPU and by the FPGA can get reordered, and which ones can’t. We’ve validated our model by checking that it lines up with the available documentation from Intel, by talking it through with some Intel engineers who designed parts of the chip, and by generating and running hundreds of little concurrent programs as test-cases on a real machine (access to which was kindly provided to us by the Intel Labs Academic Compute Environment).

Our work shows how we can begin to reason about the correctness of small concurrent programs for these systems-on-chip. For instance, we use an encoding of our model in the CBMC model checker to confirm that a queue datastructure implemented between the CPU and the FPGA is implemented correctly. On the flip side, we’ve also conducted some experiments that show what can happen if this datastructure is implemented incorrectly – that is, we’ve shown that if some synchronisation operations are omitted, then the queue can indeed lose or duplicate some elements. (Which might not be the end of the world if you’re into approximate computing.)

Ultimately, we see this work as an important step towards providing a firm foundation for reasoning about programs that run on, and compilers that target, the hybrid computing devices that are fast becoming a central part of our computing landscape.

http://johnwickerson.wordpress.com/?p=2172

Extensions

Introducing C4: the C Compiler Concurrency Checker

John Wickerson Jun 9, 2021

This post is about a paper by Matt Windsor, Ally Donaldson, and myself that will be presented shortly at the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) conference, in the tool demonstrations track. Compilers are a central pillar of our computing infrastructure, so it’s really important that they work correctly. There’s been… Continue reading Introducing C4: the C Compiler Concurrency Checker →

Show full content

This post is about a paper by Matt Windsor, Ally Donaldson, and myself that will be presented shortly at the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) conference, in the tool demonstrations track.

Compilers are a central pillar of our computing infrastructure, so it’s really important that they work correctly. There’s been a lot of work in the last decade or so on testing compilers by feeding them randomly generated programs and checking that they are compiled correctly. But most of this work has been limited to single-threaded (sequential) programs, which leaves open the question of whether compilers compile multi-threaded (concurrent) programs correctly.

There’s a good reason why existing work concentrates on the single-threaded case. The typical way to check that a randomly generated program has been compiled correctly is to compile it twice – whether that’s with two different compilers, or with two different sets of compiler flags, or before and after applying some semantics-preserving transformations – and see whether there’s any discrepancy between the two outputs. If there is, then one of the compilers must be doing something wrong! But this reasoning only works if every randomly generated program is deterministic – which multi-threaded programs seldom are.

So how can we expand compiler testing to the multi-threaded case? Our idea is to run the following sequence of steps in a loop:

First, we generate a small multi-threaded program that is potentially tricky. By ‘small’ we mean around five or six instructions spread across two or three threads. And by ‘potentially tricky’ we mean that the program exercises a corner-case of the notoriously complicated C concurrency model. Lustig et al. showed in ASPLOS 2017 how to use the Alloy Analyzer to automatically generate suites of such programs directly from an axiomisation of the C concurrency model, so we adopt their technique here. Importantly, each program in this suite comes with a postcondition that characterises all the final states that the program is allowed to reach.
Second, in the hope of making that small multi-threaded program more of a challenge for the compiler-under-test, we apply a series of semantics-preserving transformations to it, in the spirit of Lidbury et al.’s CLSmith tool, ending up with a large multi-threaded program. For instance, we might replace
  x := 42
with
  if (getenv("john")) {     x := 42   } else {     // lots of randomly     // generated instructions
  }
and then always run the generated code in an environment where the variable john has been set to 1. Actually we don’t need our transformations to be semantics-preserving in both directions; we just need to make sure that our transformations do not introduce new allowed outcomes (otherwise our tool might label correct behaviours as bugs). This means that we can also employ transformations like ‘randomly insert a fence’ or ‘randomly strengthen the consistency mode of an atomic operation from relaxed to release/acquire or from release/acquire to sequentially consistent‘. Importantly, because the small randomly generated program came with a postcondition, and because our transformations preserved that postcondition, we still know how our program should behave, even though it has now become rather large and unwieldy.
Third, we compile our large multi-threaded program using the compiler-under-test, with randomly chosen command-line options. If the compiler crashes, then we have found a bug.
Fourth, we execute the compiled program on a real machine. We execute the program thousands of times, in the presence of several ‘stressing’ threads that hammer on nearby memory locations in the hope of provoking weird behaviours. This hostile environment is created using the Litmus tool of Alglave et al. If the postcondition is violated on any of these runs, then we have found a bug.

This technique has been implemented in a tool called C4, which is available on GitHub. We applied C4 to several versions of four compilers (GCC, LLVM, the Intel C Compiler, and IBM XL) targetting three architectures (x86, IBM POWER9, and Arm8). C4 revealed two concurrency-related bugs in old versions of GCC, as well two previously-unknown bugs (which turned out, after test-case reduction, not to be related to concurrency) in IBM XL and GCC.

Of course, this is not as many bugs as we might have liked to have found! On the one hand, we suspect that if C4 employed more imaginative transformations, or if its testing throughput were optimised further, then it might be able to uncover more bugs, and this is a direction we’re pursuing. But on the other hand, we also have a growing suspicion that there may not be that many concurrency-related bugs in mainstream compilers in the first place. Perhaps compiler engineers make conservative choices when it comes to implementing concurrency, concerned by the inherent difficulties of testing multi-threaded code. Perhaps they do not shoot for super-aggressive optimisations like they do when compiling sequential code, and thus their code is less likely to contain the kind of faults that other compiler-testing tools have been so successful at exposing. Indeed, we experimented with injecting artificial bugs into LLVM, capturing the kinds of concurrency-related mistakes we’ve seen in other compilers, and found that C4 was effective at catching those. So, perhaps C4 will be end up being most valuable not for exposing existing bugs in compilers, but for giving compiler engineers the confidence to go ahead and implement more ambitious optimisations, safe in the knowledge that if they do introduce a bug, C4 will likely catch it.

http://johnwickerson.wordpress.com/?p=2162

Extensions

Fuzzing High-Level Synthesis Tools

John Wickerson May 7, 2021

High-level synthesis – the automatic compilation of a software program into a custom hardware design – is an increasingly important technology. It’s attractive because it lets software engineers harness the computational power and energy-efficiency of custom hardware devices such as FPGAs. It’s also attractive to hardware designers because it allows them to enter their designs… Continue reading Fuzzing High-Level Synthesis Tools →

Show full content

High-level synthesis – the automatic compilation of a software program into a custom hardware design – is an increasingly important technology. It’s attractive because it lets software engineers harness the computational power and energy-efficiency of custom hardware devices such as FPGAs. It’s also attractive to hardware designers because it allows them to enter their designs at a higher level of abstraction than Verilog or VHDL.

As such, high-level synthesis is being increasingly relied upon. But is it reliable? Yann Herklotz, Zewei Du, Nadesh Ramanathan and I have a paper next week at FCCM 2021 about our attempts to find out.

We follow a fairly well-trodden path. There have been several projects over the last decade or so that aim to test compilers by fuzzing – that is, by generating random source files, feeding each one into the compiler-under-test, and seeing whether it is compiled correctly. But these works have focused on conventional compilers that generate assembly code for processors. In our work, we apply a similar technique to HLS tools.

The ideas can be carried over without many changes – after all, HLS tools still take C programs as input, just like conventional compilers. We changed a few things: we restricted the random C generator to account for the fact that HLS tools don’t support the full C language, we added random HLS-specific directives (like: “pipeline this loop”) to the generated programs, and we tweaked the probabilities of the random generator to avoid generating programs that made the synthesis process unbearably long. To check that the HLS tool had compiled each random C program correctly, we executed its output using a Verilog simulator and compared that against the output from a executable that we compiled using GCC.

We generated 6700 random C programs and fed them to four widely used HLS tools: LegUp, Xilinx Vivado HLS, the Intel HLS Compiler, and Bambu. We found that all four tools could be made either to crash or to generate wrong code. In total, 1191 of our generated programs failed in at least one of the four tools, as shown in the picture below.

*An Euler diagram showing which of our 6700 test-cases failed in which tools. Overlapping regions mean that the same test-cases failed in multiple tools.*

We performed test-case reduction on some of the failures, and this resulted in the identification of 8 unique bugs, several of which have been confirmed by the HLS tool developers, and one of which (shown in the code snippet below) has already been fixed.

This program miscompiles in Bambu. Because the value of `b` is shifted to the right by 8, the output should be `0x100`, but Bambu generates a design that returns `0`. The increment operation on `a` appears unrelated, but is actually necessary to trigger the bug.

Finally, we compared how many bugs we could find in a few different versions of the same HLS tool. The diagram below shows the results of that experiment on three versions of Xilinx Vivado HLS. We see that the 2018.3 version of the tool has the most failures, though it is also striking that there are some test-cases that only fail in the most recent version.

This Sankey diagram tracks 3645 test-cases through three different versions of Xilinx Vivado HLS. The ribbons collect the test-cases that pass and fail together. The black bars are labelled with the total number of test-case failures per version. The test-cases that pass in all three versions are not shown.

In conclusion: this project has shown that compiler fuzzing techniques are effective at uncovering bugs in HLS tools. With HLS tools becoming increasingly relied upon, we hope that our work has motivated the need for these tools to be more rigorously engineered.

http://johnwickerson.wordpress.com/?p=2146

Extensions

A translation puzzle

John Wickerson Sep 18, 2020

Your task is to write down a sequence of English words that, after possibly moving around the spaces between the words, become their German translations. As stated, this is pretty easy: for instance, I could write down the single English word SAND which translates to the single German word SAND. Or I could be a… Continue reading A translation puzzle →

Show full content

Your task is to write down a sequence of English words that, after possibly moving around the spaces between the words, become their German translations.

As stated, this is pretty easy: for instance, I could write down the single English word SAND which translates to the single German word SAND. Or I could be a tiny bit more ambitious and write down a sequence of words like SAND PARK RADIO, which is similarly unchanged after translating into German.

But suppose we generalised away from just “English and German” and used a pair of languages that share an alphabet but don’t have any words that self-translate like that. Then the problem becomes rather harder because you might have to actually shuffle the spaces around in order to find a solution.

For instance, perhaps the dictionary looks like this:

Language 1Language 2CHEMOUSMOUSMOUSMOUSTAMOUSTACHECHETACHEMOUS

Can you come up with a sequence of words in Language 1 that, after possibly moving around the spaces, become their Language 2 translations? Here’s one:

"MOUS TA CHEMOUS MOUS MOUSTACHE"

is a sequence of words in Language 1, and if I move the spaces around I can get

"MOUSTA CHEMOUS MOUS MOUSTA CHE"

which is exactly the same sequence translated into Language 2.

Although that particular example wasn’t too hard in the end, it turns out that this is a rather hard problem to solve in general. In fact, it’s an instance of the Post Correspondence Problem, which is famously undecidable. That means that it’s impossible to write a program that is capable of taking any translation dictionary you can think of and telling you whether or not a solution exists.

Might still be fun to try, though.

http://johnwickerson.wordpress.com/?p=2114

Extensions

Diagrams for Composing Compilers

John Wickerson May 21, 2020

T-diagrams (or tombstone diagrams) are widely used by teachers to explain how compilers and interpreters can be composed together to build and execute software. In this blog post, Paul Brunet and I revisit these diagrams, and show how they can be redesigned for better readability. We demonstrate how they can be applied to explain compiler… Continue reading Diagrams for Composing Compilers →

Show full content

T-diagrams (or tombstone diagrams) are widely used by teachers to explain how compilers and interpreters can be composed together to build and execute software. In this blog post, Paul Brunet and I revisit these diagrams, and show how they can be redesigned for better readability. We demonstrate how they can be applied to explain compiler concepts including bootstrapping and cross-compilation. We provide a semantics for our redesigned diagrams, based on binary trees. Finally, we suggest how our diagrams could be used to analyse the performance of a compilation system.

Introduction

In introductory courses on compilers across the globe, students are taught about the interactions between compilers using ‘tombstone diagrams’ or ‘T-diagrams’. In this formalism, a compiler is represented as a ‘T-piece’. A T-piece, as shown below, is characterised by three labels: the source language of the compiler, the target language of the compiler, and the language in which the compiler is implemented.

A compiler from source language ‘src’ to target language ‘tgt’ implemented in language
‘imp’.

Complex networks of compilers can be represented by composing these basic pieces in two ways. Horizontal composition means that the output of the first compiler is fed in as the input to the second. Diagonal composition means that the first compiler is itself compiled using the second compiler. The pictures below give examples of both forms.

Left: Horizontal composition of T-pieces. Haskell is compiled to C using a compiler written in Java, and thence to ARM assembly using another compiler that is also written in Java. Right: Diagonal composition of T-pieces. An OCaml-to-x86 compiler written in C is being compiled using a C-to-x86 compiler written in x86 assembly.

In this blog post, we investigate the foundations of these diagrams. What are the rules that govern how they can be composed? Can we redesign them to make them more understandable? What do they mean in general? And what do they tell us about the compilers they depict?

Background

The usefulness of diagrams for explaining the connectivity between compilers can be traced back at least as far as the UNCOL project in the 1950s, which was an early effort to provide a ‘Universal Computer-Oriented Language’ that could interface with many different front-ends and back-ends. Diagrams such as the one reproduced below were used to show how compilers producing UNCOL could be composed with those that consume it.

A precursor to the T-diagram. An UNCOL-to-704 compiler (top left) running on an IBM 704 machine (middle) is used to compile an OTN-to-UNCOL compiler implemented in UNCOL (bottom left), which results in an OTN-to-UNCOL compiler implemented in 704 machine code (right).

Shortly thereafter, T-diagrams were proposed by Bratman as an improvement on the UNCOL diagrams. As an example, the diagram below shows Bratman’s reimagining of the diagram above.

The first T-diagram. This T-diagram depicts the same information as the previous diagram.

Three problems with T-diagrams are immediately apparent.

When two T-pieces are diagonally composed, they meet at two interfaces, and it is not clear that only one of these interfaces is required to have matching languages. Diagonal composition only requires the ‘implementation’ language of the left piece to match the ‘source’ language of the right piece.
T-diagrams typically do not distinguish the operands to the composition from the result of the composition. For example, in Bratman’s picture above, the right T-piece is actually the result of composing the left T-piece with the middle T-piece, but this relationship is not made clear by the diagram – all three pieces appear on an equal footing. (Of course, the obvious way to avoid this problem is simply not to show the result of the composition in the same diagram.)
The symmetric shape of the T-pieces invites a third mode of composition, pictured below, that is not actually meaningful. (Note that this form of composition does seem to appear in Bratman’s picture above, but this is only because the operands and the result of composition are being conflated.)

Only one of the two joining interfaces requires its languages to match (Problem 1).

This form of composition looks legal but is actually not (Problem 3)

Over the next couple of decades, several other diagrammatic systems were proposed, as pictured below. For example, Burkhardt proposed replacing Bratman’s T-pieces with X-shaped pieces, Sklansky et al. suggested D-shaped pieces, Rosin used pieces with a variety of shapes, and Earley and Sturgis extended Bratman’s T-pieces with I-pieces that represent interpreters. However, none of them satisfactorily resolved all three problems identified above.

A Burkhardt diagram of a compiler that accepts programs in language ‘N’, generates programs in language ‘SPS’, and is implemented in IBM 709 machine code.

A diagram by Sklansky et al. Step 1 shows an algorithm q written in language L (q/L) being compiled by an L-to-M compiler implemented in M (L/MM) on a machine M, to produce an implementation of q in M (q/MM). In Step 2, this implementation is executed on machine M with data δ, to produce the result q(δ).

A Rosin diagram. The input data is written in the ‘PI’ language; this is compiled to the ‘PO’ language by a compiler implemented in IBM 360 machine code. That compiler is running in a interpreter for 360 machine code that is implemented in 2065 machine code, which itself is running on an IBM 2065 machine.

An Earley–Sturgis diagram. The f function is written in Lisp, then translated to machine language (ML) by a compiler written in Lisp. That compiler is running inside a Lisp interpreter written in ML. The dashed line seems to be an attempt to clarify that the right-hand piece is the *result* of the composition (cf. Problem 2).

Our proposal: J-diagrams

We propose a redesign in which the basic piece has the shape of a backwards ‘J’.

This design makes it clear that there are two (and only two) modes of composition, as shown in the pictures below. Thus Problem 1 and Problem 3 are immediately addressed. To address Problem 2, we simply propose that J-pieces that are the result of the composition are not shown in the same diagram.

Left: Horizontal composition of J-pieces. Right: Diagonal composition of J-pieces.

We also propose two special cases of the J-piece that omit one interface. An I-piece omits the ‘target’ interface; it represents an interpreter, following Earley and Sturgis. A ‘tube’ omits the ‘implementation’ interface; it represents a compilation step that does not require an implementation, such as compiling Clight [4] to its superset, C. A ‘stopper’ omits both the ‘target’ and ‘implementation’ interface; it represents a machine that can directly execute programs in the ‘source’ language.

Left: An I-piece. Middle: A tube. Right: A stopper.

Next, we illustrate J-diagrams using several examples: Java compilation, verified C compilation using CompCert, a compiler testing framework, cross-compilation, and the history of the XPL programming language. In each case, we argue that the J-diagram is intuitive and clear, especially when compared to an equivalent T-diagram or ad-hoc diagram.

Java compilation as a J-diagram. A Java program is compiled to Java Bytecode using the javac compiler, which is written in Java. The JVM is an interpreter for Java Bytecode written in C++. The JVM itself is compiled using a C++ compiler such as gcc. Both javac and gcc must themselves be compiled somehow, but this diagram elects to leave those steps unspecified.

A J-diagram showing the high-level architecture of the CompCert compiler. The initial translation from C into the Clight sublanguage is not verified, but the rest of the compiler is written in Coq. The Coq tool is used to extract executable OCaml code.

A J-diagram showing how compilers written by students are assessed. Students submit a C-to-MIPS compiler implemented in C++, which is built using g++. The output of this compiler is tested by running it on the qemu MIPS emulator, which is implemented in C and built using gcc. In this diagram, we assume that g++ and gcc have both been compiled for an x86 machine.

Explaining cross-compilation using (top) an ad-hoc diagram from Wikipedia and (bottom) a J-diagram. The aim here is to use one machine (say, running Windows on an IA-32 processor) to build a compiler that can execute on another machine (say, running Mac OS on an x86_64 processor) and whose output that be executed on a third machine (say, running Android on an ARM processor). In both cases, the sequence is as follows. We first use the native Microsoft Visual C++ (msvc) compiler on the Windows machine to build the native g++ compiler for that machine. We then use this compiler to build a g++ compiler that targets the Mac. This compiler, in turn, is used to build the final compiler, which will execute on the Mac but will generate programs that execute on the Android device.

Explaining the history of the XPL language using (top) a T-diagram and (bottom) a J-diagram. The T-diagram, which is due to McKeeman, *is hard to understand* because of the large number of T-pieces and because of the ambiguity about which connections are meaningful. As a J-diagram, the history of the XPL language becomes much easier to read. The first XPL compiler (XCOM I) targeted IBM 360 machine code, and was written in ALGOL, which itself was compiled and executed on a Burroughs B5500 computer (far right). The second XPL compiler (XCOM II) was written in XPL, and was compiled using XCOM I, via a bootstrapping process. Similarly, the third version (XCOM III) was compiled using XCOM II. At the far left of the diagram, we see XPL being used to produce a compiler for a new user.

Interpreting J-diagrams

Our J-diagrams can be interpreted as vertex-labelled binary trees. As shown below, each J-piece represents a tree with three vertices, with each vertex labelled either with a language or with the distinguished ‘•’ symbol.

Interpreting J-pieces, I-pieces, tubes, and stoppers as vertex-labelled binary trees.

Composition of pieces corresponds to gluing trees together so that a leaf of the first tree has the same label as the root of the second tree. The picture provides an example of a tree obtained in this way.

The J-diagram from earlier in this blog post, interpreted as a tree.

We claim that interpreting a J-diagram as a tree in this way is natural. At the root of the tree is the ‘source’ language we wish to execute. At the leaves of the tree are all the languages that we need to be able to execute in order to be able to execute the language at the root of the tree. There is actually no particular need to distinguish between the ‘implementation’ language and the ‘target’ language – both are simply languages that we need to be able to execute. In other words, a compiler can be seen as a scheme for reducing the problem of executing programs in its ‘source’ language into two different problems:

the problem of executing programs in its ‘target’ language (the compiler’s output) and
the problem of executing programs in its ‘implementation’ language (the compiler itself). For example, the J-diagram above can be thought of as a method for executing C programs that depends upon a method for executing x86 programs.

Using J-diagrams to analyse compilers

We now suggest how J-diagrams could be used to aid the analysis and comparison of compilation strategies.

The picture below shows a graph of possible compilation strategies from LISP to JavaScript. This is the kind of graph generated by Akram’s tool for discovering all ways – whether direct or indirect – of compiling between any two given languages. Each vertex represents a language, and each edge represents a compiler from one language to another.

What is missing from this graph is information about how each compiler is implemented, and hence which further compilers may be needed in order to build them. To address this shortcoming, the picture below shows both possible compilation strategies as a pair of J-diagrams. By using this representation, it becomes clear that if we compile via Java Bytecode we shall also need the ability to execute Java code, and if we compile via LLVM IR we shall also need the ability to execute C++ code. Thus, more information is made available for the designer to make an informed choice.

Designing a compilation strategy from LISP to JavaScript with the help of J-diagrams. Left: LISP is compiled to Javascript via Java Bytecode using ABCL and TeaVM, both of which are implemented in Java. Right: LISP is compiled to Javascript via LLVM IR using Clasp and Emscripten, both implemented in C++.

If we have additional information about each J-piece beyond simply the languages involved, then we can use J-diagrams as an aid for reasoning about the performance of compilation strategies, as outlined next.

For any single J-piece, it is natural to ask:

How good is the generated output? Or, to coin a phrase, how well-targeted is the compiler? As a simple example, a compiler that inserts unnecessary sleep() instructions into its generated output is probably not well-targeted.
How long does it take to run? Or, to coin another phrase, how well-implemented is the compiler? As a simple example, a compiler whose source code includes unnecessary sleep() instructions is probably not well-implemented.

Then, for a J-diagram with source language S, it is natural to ask: if I use this compilation strategy to execute programs in language S, how quickly will I get the result?

A straightforward answer is that it depends upon how well-targeted and well-implemented all the compilers in the network are, since all of the compilers must be run before the final result can be obtained. For instance, in the J-diagram of the CompCert compiler above, in order to calculate the result of running the input C program, we need to run Coq to obtain an OCaml implementation of the verified compiler, then we need to run both the unverified compiler and the verified compiler. If the Coq compiler is well-targeted, then the verified compiler will be well-implemented.

A more nuanced answer involves distinguishing the time taken to obtain an executable (“compilation time”) from the time taken to run that executable (“running time”). In fact, there are several compilation times here: the time taken to compile the source program, the time taken to compile that compiler, the time taken to compile that compiler, and so on.

Traditionally, one is more concerned with improving the running time than the compilation time, since an executable can be compiled once and then run many times. In turn, the time taken to compile the executable tends to be more important than the time taken to compile the compiler, since a compiler can be compiled once and then used many times.

These various compilation times correspond to the different rows of a J-diagram. The running time depends upon how well-targeted the top row of J-pieces are, and how efficient is the machine or interpreter that runs this executable. The time taken to obtain this executable, on the other hand, depends upon how well-implemented the top row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces (in the second row) that compile them. The time taken to obtain the compiler that obtains this executable depends upon how well-implemented the second row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces in the third row, and so on.

For instance, the time taken to obtain the CompCert compiler depends upon how well-implemented the Coq compiler is. The time taken to use CompCert obtain an executable from a user-provided C program depends upon how well-implemented the unverified and verified compilers are, and how well-targeted the Coq compiler is. The running time of this executable depends upon how well-targeted the unverified and verified compilers are.

We can formalise this using the runningTime and compileTime functions defined below. As an example, we can see by unfolding those definitions that the ‘compile time’ for the J-diagram about student compilers depends upon the well–implementedness of the student’s compiler and the well–targetedness of g++, and that the ‘running time’ for that J-diagram depends upon the well–targetedness of the student’s compiler, the well–implementedness of qemu, and the well–targetedness of gcc, all of which is as expected.

How the running time and compile time of a program depend upon the well- implementedness and well-targetedness of the various compilers involved. Following a Prolog-like notation, we write ‘:−’ to mean ‘depends upon’.

Conclusion

We have investigated the foundations of diagrams that describe how compilers can be composed. To conclude, let us return to the four questions posed in the introduction.

First, we explained that existing graphical systems, chiefly T-diagrams, are unsatisfactory because the rules about how compilers can be composed are not intuitive; for instance, one form of diagonal composition is legal but the symmetric one is not.
Second, we presented a new visual language, based on ‘backwards J’-shaped pieces, that addressed the shortcomings of previous systems.
Third, in answer to the question of what these diagrams ‘mean’, we showed how they can be interpreted quite straightforwardly as binary trees. A J-diagram can thus be thought of as a strategy for problem-reduction: it describes how the problem of executing programs in the language at its root vertex can be reduced to the problems of executing programs in all the languages at its leaf vertices.
Fourth, in answer to the question of what these diagrams tell us about the compilers they depict, we suggested how they could be a useful aid when comparing different compilation strategies, or when analysing which parts of the compilation process are on the ‘critical path’ regarding compiling time or running time.

Ultimately, we expect J-diagrams will prove most valuable in teaching settings, because they are – we believe – an improvement upon the T-diagrams that are already in widespread use.

The authors would like to thank Diana Costa, Alastair Donaldson, Roser Pujadas, Will Venters, and Matt Windsor for helpful discussions.

http://johnwickerson.wordpress.com/?p=2078

Extensions

Uncovering multicore interference

John Wickerson Apr 20, 2020

Originally posted on Dan's Blog:
If I visit an exotic destination, I would surely do some sightseeing. I would cycle to that destination and I would use Google Maps for navigation. At most crossroads, I need to know in which direction I should turn. Other apps running on my phone might delay the indications…

Show full content

Dan's Blog

If I visit an exotic destination, I would surely do some sightseeing. I would cycle to that destination and I would use Google Maps for navigation. At most crossroads, I need to know in which direction I should turn. Other apps running on my phone might delay the indications provided by Google Maps and therefore I can mistakenly make a wrong turn and possibly be eaten by some large animal.

The reason why this is happening is that on all modern multicore processors, applications on different cores of the system are competing for the shared resources. This is not a problem in most applications but it can be in time-sensitive ones. If we design any such time-sensitive programs, we need to know what to expect in terms of interference so that we can accommodate for the delays. The most common approach for evaluating delays in multicore systems is to run…

View original post 406 more words

http://johnwickerson.wordpress.com/2020/04/20/uncovering-multicore-interference/

Extensions

Highlights from FPGA 2020

John Wickerson Feb 27, 2020

Here are a few personal highlights from the FPGA 2020 conference, which took place this week in Seaside, California. (Main photo credit: George Constantinides.) Jakub Szefer‘s invited talk on “Thermal and Voltage Side Channels and Attacks in Cloud FPGAs” described a rather nifty side-channel through which secrets could be leaked in the context of cloud-based… Continue reading Highlights from FPGA 2020 →

Show full content

Here are a few personal highlights from the FPGA 2020 conference, which took place this week in Seaside, California. (Main photo credit: George Constantinides.)

Jakub Szefer‘s invited talk on “Thermal and Voltage Side Channels and Attacks in Cloud FPGAs” described a rather nifty side-channel through which secrets could be leaked in the context of cloud-based FPGAs. The context is that one user of an FPGA would like to transmit information secretly to the next user of the same FPGA. Jakub suggests transmitting this information via the physical temperature of the FPGA. His proposal is that if the first user would like to transmit a ‘1’, they heat up the FPGA by running a circuit that uses a lot of current, such as a ring oscillator; if they would like to transmit a ‘0’, they don’t. The second user, once they arrive, measures the temperature of the FPGA. Jakub showed that even if there is a reasonable gap between the first user finishing and the second user starting, the measurement can still be taken fairly reliably because the heat is dissipated quite slowly. Follow-on work asks whether it is possible to transmit more information by just heating up (and then measuring the temperature of) parts of the FPGA.

Stefan Nikolic‘s paper on “Straight to the Point: Intra- and Inter-Cluster LUT Connections to Mitigate the Delay of Programmable Routing” (with Grace Zgheib and Paolo Ienne) struck me as a nice example of ‘out of the box’ thinking. FPGAs are designed to allow any two LUTs to be connected to one another, but because of their rigid structure, the route between them may be a little slow and circuitous. Stefan’s paper asks what would happen if we could just stick a direct wire between certain pairs of LUTs, like you can when designing fully-custom hardware. It’s a hypothetical question, of course, because FPGAs don’t actually allow this. But Stefan’s experiments using a simulator suggest that some significant performance improvements could be obtained if they did, so perhaps the FPGA manufacturers should bear this in mind for the future!

Philip Leong‘s presentation on “LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations” (on joint work with Seyedramin Rasoulinezhad, Siddhartha, Hao Zhou, Lingli Wang, and David Boland) was motivated by the observation that in several important applications, a lot of the LUTs on an FPGA are employed to implement XOR operations. His proposal is to modify the design of the FPGA itself to add dedicated XOR gates for handling these operations. The idea is that dedicated XOR gates can implement XOR operations rather more efficiently than general-purpose LUTs can. Moreover, by moving XOR operations out of the LUTs, those LUTs can be more usefully employed for other things. I found it rather interesting to note the connection between the work Philip described and some previous work from Imperial on LUTNet. Both take as their starting point the observation that a lot of LUTs are being used to implement XOR operations. But where Philip’s paper suggests moving these XORs to their own hardware, LUTNet suggests replacing the XOR operations with different ones that can still be implemented on a LUT but may be more suitable for that particular design, thus exploiting the flexibility of LUTs. Perhaps future work will see both solutions combined together.

Vinod Kathail from Xilinx gave a keynote about the new Vitis high-level synthesis (HLS) framework. Vitis HLS seems quite appealing to me, for two reasons. One is that it seems to expose rather more of the internal stages of the HLS process than Xilinx’s previous tools, which should make it easier for us to prototype research ideas for HLS. The other is that Vitis HLS promises not to need the plethora of ‘pragma’ annotations that current HLS tools expect programmers to provide. If it is really the case that Vitis HLS can manage to produce performant hardware without relying on these hints, then we may be finally heading towards a world where software programmers really can produce hardware designs without needing a degree in electrical engineering!

Sameh Attia‘s paper on “StateMover” (with Vaughn Betz) described a fascinating system for debugging hardware that sits halfway between simulating it and executing it. The advantage of simulation is its flexibility: the designer can inspect (or even change) the contents of every component, and they can run the clock forwards or backwards until they pinpoint the problem in their design. But the disadvantage of simulation is its slowness: simulating a design can take about 100000 times longer than executing it properly. Sameh’s work describes a way to move the entire state of a design back and forth between an FPGA and a software simulator. His proposal is that a design starts running on an FPGA, then at some predetermined trigger point, the clock is stopped and the current state of the FPGA is copied over to the simulator. The designer can then investigate what is going on at this point, and maybe step the clock forwards or backwards a bit, and then they can copy the new state back to the FPGA and carry on running it. The best of both worlds is thus achieved: the fiddly parts of the design can be run in the simulator, and the long, uninteresting parts can be left to the hardware. Myself, I wondered whether this system could form the basis of an exhaustive hardware simulator: one that runs a design in hardware until comes to some sort of ‘branch’ or ‘decision point’, at which point it saves its current state, and executes one of the paths. Later it restores this saved state and executes the other path. Thus, it might be feasible to explore all the possible behaviours of a system.

Lana Josipovic‘s Best-Paper-award-winning paper on “Buffer Placement and Sizing for High-Performance Dataflow Circuits” (with Shabnam Sheikhha, Andrea Guerrieri, Paolo Ienne, and Jordi Cortadella) was concerned with improving the performance of dynamically scheduled circuits, which can be thought of as a little like Petri nets. Inserting buffers into these circuits cannot change their functional behaviour, because every component executes when (and only when) all of its inputs are ready; adding buffers may mean that these inputs become ready a little later, but it cannot change their values. Nonetheless, there are a few reasons for inserting buffers into a circuit: to break up the critical path so as to increase the operating frequency, to break a combinational cycle, or to balance the latency of the tines in a control flow fork. Lana uses linear programming to work out when to place buffers in a circuit, where to place them, and how big to make each one; the result is an HLS tool that can generate considerably faster circuits.

Han Chen‘s paper on “FPGA-Accelerated Samplesort For Large Data Sets” (with Sergey Madaminov, Michael Ferdman, and Peter Milder) addresses one of the most long-studied problems in computer science: how can we sort data quickly? They describe a hardware-accelerated implementation of samplesort, which is a pleasingly simple sorting algorithm that I had not come across before. Samplesort is a little like mergesort, in that it involves partitioning the dataset into lots of parts that are all sorted independently. But where mergesort splits the dataset blindly, sorts each part, and finally merges the sorted subsets back together, samplesort first divides the dataset into several buckets (with each bucket covering a specific range of values), then sorts each bucket, and finally concatenates the buckets together. So both algorithms have an embarrassingly-parallel middle step, but mergesort has its complexity in the third step and samplesort has its complexity in the first step. Of course, the tricky part of samplesort is working out what ranges to use for each bucket so that each one ends up containing roughly the same number of data items. This is addressed by having a extra preprocessing stage, in which the data is sampled (hence the name of the algorithm) to estimate the appropriate range for each bucket. There are all sorts of interesting trade-offs in Han’s design. How many buckets should there be? (Presumably this depends on how much many buckets it is possible to sort in parallel.) How long should we spend sampling? (Time spent sampling is time not spent sorting, but if the sampling is inaccurate then the subsequent sorting will not be efficient.) How much contingency space should there be in each bucket? (If the buckets are too small, they may spill over, but if they are unnecessarily big, then hardware resources are being wasted.)

Finally, there were also two papers co-authored by me. Yann Herklotz‘s paper on “Finding and Understanding Bugs in FPGA Synthesis Tools” described his efforts to find bugs in tools like Quartus, Vivado, and Yosys; his presentation was greeted with an amusing mixture of alarm and pride from the developers of these tools, depending on the respective number of bugs he found in each! And Jianyi Cheng‘s paper on “Combining Dynamic and Static Scheduling in High-level Synthesis” (also with Lana Josipovic, George Constantinides, and Paolo Ienne) described his quest for a scheduling approach that combines the advantages of static (i.e. compile-time) scheduling and dynamic (i.e. run-time) scheduling. Both Yann and Jianyi did a great job with their presentations, making me a very proud supervisor!

http://johnwickerson.wordpress.com/?p=2070

Extensions

New names for odd and even functions

John Wickerson Feb 19, 2020

The concept of “odd” and “even” functions is quite important in areas such as Fourier analysis. An odd function is one that has 2-fold rotational symmetry around the origin; examples include , , and . An even function is one that has mirror symmetry around the y-axis; examples include and . The names “odd” and… Continue reading New names for odd and even functions →

Show full content

The concept of “odd” and “even” functions is quite important in areas such as Fourier analysis. An odd function is one that has 2-fold rotational symmetry around the origin; examples include $y=x$ , $y=x^3$ , and $y=\sin x$ .

An even function is one that has mirror symmetry around the y-axis; examples include $y=x^2$ and $y=\cos x$ .

The names “odd” and “even” come from the exponents in the examples above. For instance, $y=x^3$ is called an odd function because 3 is an odd number.

However, “odd” and “even” are actually not very good names for these functions. The thing is, there are several useful identities related to these functions, such as

whenever you add an odd function to another odd function you get an odd function

and

whenever you multiply an odd function by an even function you get an odd function

and almost none of these line up with the well-known identities for odd and even numbers. In mathematics, the whole point of using the same name for two different concepts is to tell the reader that much of the intuition from one concept can be carried over to the other. But this does not happen here. When you add an odd number to another odd number you get an even number, not an odd one. When you multiply an odd number by an even number you get an even number, not an odd one. Moreover, some functions are both odd and even (namely, the $y=0$ function), but no numbers are both odd and even. The analogy between odd/even functions and odd/even numbers simply does not work at all.

Odd/even functionsOdd/even numbersodd + odd = oddodd + odd = eveneven + odd = could be eithereven + odd = oddeven + even = eveneven + even = evenodd × odd = evenodd × odd = oddeven × odd = oddeven × odd = eveneven × even = eveneven × even = even

It would be better, I think, if odd functions were renamed as “negative” functions, and even functions were renamed as “positive” functions. These names aim to build an analogy to negative and positive numbers, and this time the analogy actually works. When you add a negative number to another negative number you get a negative number. When you multiply a negative number by a positive number you get a negative number. And the y=0 function corresponds naturally to the number 0 which is both positive and negative (or neither, if you prefer).

Negative/positive functionsNegative/positive numbersneg + neg = negneg + neg = negpos + neg = could be eitherpos + neg = could be eitherpos + pos = pospos + pos = pospos + pos = pospos + pos = pospos + pos = pospos + pos = pospos × pos = pospos × pos = pos

Of course, there are disadvantages to calling these functions “negative” and “positive”. For instance, calling a function “positive” might give the misleading impression that it lies entirely above the $y=0$ line. Still, no names are perfect, and I reckon that this disadvantage notwithstanding, my proposed names are better than the current ones.

http://johnwickerson.wordpress.com/?p=2058

Extensions

Fuzzing FPGA synthesis tools

John Wickerson Nov 22, 2019

When you want to do some computation on an FPGA, it is traditional to enter your design in a language like Verilog, and then to use automatic synthesis tools to turn your Verilog design into a “configuration bitstream” that can be fed to your FPGA to make it perform the computation you want. These synthesis… Continue reading Fuzzing FPGA synthesis tools →

Show full content

When you want to do some computation on an FPGA, it is traditional to enter your design in a language like Verilog, and then to use automatic synthesis tools to turn your Verilog design into a “configuration bitstream” that can be fed to your FPGA to make it perform the computation you want.

These synthesis tools occupy an extremely privileged position. They are, after all, responsible for almost all of the world’s FPGA-based computation. There is no feasible way to compute with an FPGA other than via these tools. They also occupy an extremely trusted position. We have to trust that they faithfully reproduce the computation we specified in Verilog, because we have no completely reliable way to check whether they have made a mistake. This is especially the case given that the details of how FPGAs are configured are generally not made public.

In our FPGA 2020 paper, my student Yann Herklotz and I ask whether these tools really deserve our trust.

To answer this, we set about doing some fuzzing. We were inspired by a tool called Csmith, whose sole purpose is to generate large, random C programs. These C programs don’t do anything in particular, but they are a nightmare for compiler-writers because they combine C language features in weird ways that humans wouldn’t ordinarily think of. Programs generated by Csmith have exposed hundreds of bugs in widely-used C compilers. We wanted to see if we could do the same thing for FPGA synthesis tools.

To this end, we produced a tool called Verismith that generates large, random Verilog programs. Like Csmith, we have to take care that the programs we generate are “legal”; after all, if you put bogus Verilog into a synthesis tool, you can’t expect sensible output. This means making sure that we don’t do things like drive a wire from two different sources, pass the wrong number of bits to a module, or divide by zero.

We used Verismith to generate about fifty thousand Verilog programs, and we gave each of these to four different FPGA synthesis tools: Xilinx XST, Xilinx Vivado, Intel Quartus, and Yosys. The output of these tools is a netlist – a list of interconnected low-level logic gates. (This netlist can then be turned into a configuration bitstream by other automatic tools that we have not tested.) To check whether the synthesis tool has worked correctly, we ask a SAT solver if there are any discrepancies between the input Verilog with the output netlist. If there are, then we have found a bug!

Each time we find a bug, we use Verismith to reduce it so that it is as small (and hence understandable) as possible. To do this, Verismith repeatedly chops the Verilog program in half, each time discarding the half that does not trigger the bug, until it cannot make it any smaller.

After reducing all the buggy programs to be as small as possible, we decided that some of these programs are really exposing the same bug. Ultimately, we found and reported 4 bugs in Yosys, 5 in Xilinx Vivado, and 1 in a Verilog simulator called Icarus. All of these bugs were confirmed by the vendors. (We didn’t report the bugs we found in Xilinx XST as this tool is no longer supported.)

Our programs trigger two different kinds of bugs in the synthesis tools. Some of them trigger “crash” bugs, which means that the synthesis tool goes wrong even though the input is perfectly valid. Others trigger “mis-synthesis” bugs, which means that the synthesis tool produces a netlist that is not equivalent to the input Verilog. In our paper, we dig into a handful of the bugs that we found, to explain exactly what the tools are doing wrong.

It’s hard to say for sure how much these bugs really matter. If they only come to light when the tools are given fairly baroque Verilog that no human would realistically write, and no Verilog-targeting compiler would automatically generate, then perhaps we need not be too concerned. That said, one observer on the Xilinx bugtracking forums did comment that one of the bugs we found “looks […] to be a rather critical bug”. Personally, I think any bug in a synthesis tool is enough to undermine its trustworthiness, whether or not it is likely to affect the computation that I want to do on my FPGA.

http://johnwickerson.wordpress.com/?p=1942

Extensions

A plot twist! Drawing better graphs in PL papers

John Wickerson Nov 1, 2019

In this post, I’d like to share some thoughts I’ve accumulated over the past few years about how to draw better graphs. To get straight to the point, I have two concrete recommendations: normalised data should usually be plotted on a logarithmic scale, and scatter plots can be easier to understand than bar charts. I’ll… Continue reading A plot twist! Drawing better graphs in PL papers →

Show full content

In this post, I’d like to share some thoughts I’ve accumulated over the past few years about how to draw better graphs.

To get straight to the point, I have two concrete recommendations:

normalised data should usually be plotted on a logarithmic scale, and
scatter plots can be easier to understand than bar charts.

I’ll now elaborate on both of these points, drawing upon examples from 31 examples of graphs I found in the proceedings of PLDI 2019.

Logarithmic scales for normalised data

I believe that normalised data should be plotted on a logarithmic scale. By “normalised data”, I mean data that is the ratio between two measurements that have the same dimension. For example: the ratio between the execution time of a program before a proposed compiler optimisation has been applied and the execution time of that program afterwards.

I will illustrate my reasons with reference to the two graphs below, which both show some sort of “speedup” that has been obtained on four benchmark programs, A, B, C, and D. The left graph uses a linear scale on the y-axis, while the right one plots the same data on a logarithmic scale.

There are four reasons why the logarithmic scale is better:

The natural origin for a speedup ratio is 1, not 0. That is, we are primarily interested in seeing whether a data point lies above 1 (which indicates a speedup) or below 1 (which indicates a slowdown). This fits nicely with logarithmic scales, which can’t go down to 0. In the right-hand graph above, it is immediately obvious that A and B experience a slowdown; this is slightly less obvious in the left-hand graph.
Going from a 1x speedup to a 2x speedup is surely more impressive than going from a 3x speedup to a 4x speedup. But on the linear y-axis in the left-hand graph above, the distance between 1 and 2 is the same as the distance between 3 and 4, so these feats would appear equally impressive.
Often, it is just as good to get a 2x speedup as it is bad to get a 2x slowdown. But on the linear y-axis in the left-hand graph above, the distance from 1 to 0.5 is much smaller than the distance from 1 to 2, so the speedup experienced by benchmark C is emphasised over the slowdown experienced by benchmark B, even though both have the same magnitude.
On a linear scale, the “centre of gravity” to which the eye is drawn lies at the arithmetic mean, while on a logarithmic scale, the centre of gravity is at the geometric mean. When averaging dimensionless ratios, most authors tend to use the geometric mean.

One caveat: the logarithmic scale doesn’t work well if the normalised data can be zero. This is quite rare though – I don’t think I saw this in any of the PLDI 2019 papers I scanned through when writing this post.

Here are some examples of graphs from PLDI 2019 that I believe would be improved by using a logarithmic scale on their y-axes.

This first graph demonstrates why y=0 is not the best origin – if the bars started at y=1 (and then stretched upward for speedups or downwards for slowdowns) then there would be no need for the thick black line at y=1.

This graph would also be clearer if its bars emanated from y=1. Moreover, this would serve to remove the unnecessary orange bars, which represent the values against which the other bars are normalised.

These two examples have the unfortunate property that some bars extend beyond the top of the scale. Another advantage of the logarithmic scale is that it handles unusually large (and unusually small) values gracefully.

This example is particularly unusual. It has the nice property of treating speedups and slowdowns symmetrically, but has a disconcerting discontinuity between y=-1 and y=1, and requires the reader to grapple with strange concepts like negative speedup factors! (From reading the surrounding text, I understand that a speedup factor of -2 should be interpreted as a 0.5x speedup, or a 2x slowdown.)

I’m a little torn about these two examples, both of which plot speedup against the degree of parallelism. In both cases, the advantage of plotting the speedup on a linear scale is that we can compare against the ideal situation of “speedup factor equals parallelism degree”, which is depicted as a straight line at y=x. Perhaps a reasonable compromise in these situations would be to use logarithmic scales for **both** axes.

Here is an assortment of other graphs from PLDI 2019 that plot normalised values on a linear scale.

I’d like to give an “honourable mention” to these two graphs, which do use logarithmic scales, but could be improved further by starting the bars at y=1 rather than at an arbitrary small number like y=0.01.

In praise of scatter plots

The problem with bar charts is that the x-axis tends to be underemployed. It has the capacity to represent some sort of quantity, but often is used simply to spread out a collection of benchmarks in some arbitrary order. I think that information can often be conveyed more effectively to the reader by using a scatter plot instead.

As an example, here is a classic bar chart, showing the performance of a “new” technique compared to an “old” technique over a range of benchmarks.

The graph is rather straightforward to read: when the green bar is higher, the old technique is faster, and when the red bar is higher, the new technique is faster. This graph provides all the information needed to compare the two techniques. But it is not easy to see “at a glance” which technique is better.

Here is an analogous scatter plot.

I reckon that scatter plots take longer to read than bar charts, but less time to understand. What I mean is: it’s immediately obvious from the bar chart that the height of a bar represents the time taken for that benchmark, but it takes a few moments to work out that in the scatter plot, the points above the diagonal represent benchmarks where the old technique is faster, and those below the diagonal represent benchmarks where the new technique is faster.

However, once this has been established, it becomes straightforward to compare the two techniques. One can immediately make observations like “the new technique seems to win on the shorter-running benchmarks, but to lose on the longer ones”.

It’s worth pointing out that my scatter plot shows less information than my bar chart, because it does not identify the individual benchmarks. Perhaps, if there are not too many benchmarks, it would be possible to label the points individually, or to use a different colour for each point. Of course, this risks overcomplicating things, and we are often more concerned with general trends than with the performance of particular benchmarks. A reasonable compromise might be to colour a handful of the particularly interesting points so that they can be referred to in the surrounding text.

By the way, I have used semi-transparent markers in my scatter plot. I find this quite an attractive way to deal with multiple points being almost or exactly on top of each other. With opaque markers, coincident points could get lost.

Scatter plots can cope with more adventurous situations too. For instance, here is a scatter plot that compares two variants of the “new” technique against the “old” one.

And here’s a scatter plot that conveys the uncertainty surrounding each data point using ellipses. The width of the ellipse corresponds to the uncertainty in the x-value, and the height of the ellipse corresponds to the uncertainty in the y-value. Ellipses that cross the y=x diagonal represent benchmarks where we’re not sure which technique is better.

Here are some examples of graphs from PLDI 2019 that might be better as scatter plots.