GeistHaus
log in · sign up

https://johnwickerson.wordpress.com/feed

rss
32 posts
Polling state
Status active
Last polled May 19, 2026 03:25 UTC
Next poll May 20, 2026 05:53 UTC
Poll interval 86400s
Last-Modified Fri, 08 May 2026 12:58:10 GMT

Posts

How do bitwidths actually work in Verilog?
AcademiaComputer ScienceFPGAscomputersprogrammingsciencetechnology
This post introduces a project that Gabriel Desfrene did in my group as an intern last summer. Gabriel will present his project in a paper coauthored with my PhD students Quentin Corradi and Michalis Pardalos and myself at CAV 2026 in July. Verilog (aka SystemVerilog) is the most widely used language for specifying, designing, and… Continue reading How do bitwidths actually work in Verilog? →
Show full content

This post introduces a project that Gabriel Desfrene did in my group as an intern last summer. Gabriel will present his project in a paper coauthored with my PhD students Quentin Corradi and Michalis Pardalos and myself at CAV 2026 in July.

Verilog (aka SystemVerilog) is the most widely used language for specifying, designing, and verifying all kinds of computer hardware. But some aspects of the language are widely misunderstood – even some as fundamental as “how many bits are used for each operation”.

Consider this example, which is due to Quentin:

logic [3:0] foo;
assign foo = (3'b110 + 3'b110 + 3'b110 + 3'b110) >> 2;

The first line is declaring foo to be a 4-bit variable. On the second line, 3'b110 is Verilog for “the 3-bit binary number 110”, and >> is the right-shift operation. That is, we are performing the addition of four copies of 110, then shifting to the right by 2, and finally storing the result in foo.

Verilog is a hardware description language: it describes pieces of hardware. This means that the addition written in Verilog has to map to a physical “adder” in the eventual hardware. And in order for that adder to come into being, it must have a width.

But what is the correct width?

  • Option 1. We could use a 3-bit adder – the same width as the operands. Then 110+110+110+110 gives 000 (it would have been 11000, but the highest two bits are lost to overflow). And after the right-shift we get 000. Thus, the final value of foo, after adding a leading zero to make a 4-bit value, is 0000.
  • Option 2. Perhaps we should use a wider adder? This would have the advantage of not losing bits to overflow, but the disadvantage of requiring more circuitry (which leads to hardware that ultimately consumes more electricity). If we don’t want to lose any bits to overflow here, we need at least a 5-bit adder. Then 110+110+110+110 gives 11000, and after the right-shift we get 00110. Thus, the final value of foo, after chopping off the leading zero to make a 4-bit value, is 0110.

Which of these options does the Verilog language standard say is the correct one? Well, it actually opts for Option 3: the width of the adder is, in fact, determined by the width of the variable being assigned to (i.e. foo) which is 4 bits in our example. So the addition 110+110+110+110 gives 1000 (losing just one bit to overflow this time), and after the right-shift we get 0010 as the final value for foo.

Examples like this are a well-known source of confusion among hardware engineers. For instance, a fairly well established Verilog tutorial from ChipVerify.com quite reasonably warns its readers that “[calculating the bitwidths of expressions] is subtle”.

It then goes on to get the explanation completely wrong. It incorrectly states that if a and b are 8-bit variables, then “assign product = a * b” will “lose the upper bits of the actual 16-bit product”, even if product has 16 bits. (In fact, this multiplication happens at 16 bits, that being the width of the left-hand side of the assignment, so no bits are lost to overflow.)

I asked Reddit’s resident Verilog experts about ChipVerify’s example. They unanimously (at the time of writing) described the tutorial as incorrect.

But they all gave different reasons as to why it is incorrect.

And none of those reasons were right.

Still, as Bill Vaughan famously wrote, “To err is human; to really foul things up requires a computer.” So, let’s see whether today’s AI models understand Verilog’s bitwidth rules any better than those humans do. Here’s what ChatGPT said when I asked it what value gets assigned to foo in my example:

Each operand is a 3-bit value (3'b110 = 6). In SystemVerilog, when you add fixed-width vectors, the result does not automatically grow—it stays the width of the operands unless you explicitly widen it. So each intermediate addition is truncated back to 3 bits [and] the full sum becomes 3'b000. […] Finally, assigning to logic [3:0] foo zero-extends the result: foo = 4'b0000 (decimal 0).

In other words, it’s going for Option 1. This is wrong.

Meanwhile, Meta’s Muse Spark model goes for Option 2, which is also wrong:

In SystemVerilog, when you add unsigned values, the result width is max(widths) + 1 for each addition. […] Result: foo = 4'b0110 = 6 decimal. Key gotcha: If you thought the 3-bit operands would truncate the sum at each step, they don’t. Verilog/SystemVerilog expands the width during expression evaluation. The truncation only happens on the final assignment to the 4-bit foo, but 6 fits fine.

Claude Opus 4.1 also goes for Option 2 to begin with, but then it changes its mind mid-response, and ends up going for the correct Option 3:

The value assigned to foo is 6 (binary 0110). […] Wait—that’s not quite right either. Let me reconsider the expression context. […] Hmm, but there’s more nuance—the 3'b110 literals are explicitly 3-bit, and the intermediate width depends on how operands combine. […] Since the final assignment target is 4-bit foo, the entire expression is context-determined to 4 bits. […] Final answer: foo = 4'b0010 (decimal 2).

Google’s Gemini is the only model I tried that confidently gets it right first time:

In SystemVerilog, the bit-width of an expression is determined by the “context-determined” rules. Since foo is a 4-bit logic type, the entire right-hand side is evaluated using 4-bit precision. […] The value assigned to foo is 4'b0010 (decimal 2).

Anyway, the long and short of it is this: Verilog’s rules for bitwidths are confusing.

To some extent, the rules are unavoidably complicated, but a large part of the problem is that they are specified in informal prose that is spread over several parts of the (very large) Verilog standard.

So, aiming to bring as much clarity as possible to the situation, we decided to formalise Verilog’s rules for bitwidths in the Rocq theorem prover. The murky details are in our paper and in Gabriel’s Rocq development, but in this blog post I’ll just explain how things work roughly, via a few examples.

The calculation happens in two stages. These stages are called “determine” and “propagate”. In the “determine” phase, bitwidth information flows up the expression tree, calculating the “self-determined width” of every sub-expression. Then, in the “propagate” phase, bitwidth information flows back down the expression tree, fixing the “final width” of every sub-expression.

Here’s how it works on our example. We start with the expression we want to analyse, drawn as a syntax tree:

The addition is labelled as just a “BinOp” (binary operation) because all arithmetic operations (additions, subtractions, multiplications, and so on) are handled in the same way. The only node that has a bitwidth already is foo, and that bitwidth is obtained from foo‘s declaration.

The “determine” phase sends bitwidth information up this tree, as shown by the red arrows in the following picture:

Working from the bottom, we see that the constants all have a self-determined width of 3, and this fixes the self-determined width of the BinOp at 3 too (3 being the max of 3, 3, 3, and 3). Thus the ShiftOp also has a self-determined width of 3 (it ignores the width of its right-hand operand). The width of the top-level assignment operation is determined only by its left-hand operand, foo, which has a width of 4.

Then, the “propagate” phase sends bitwidth information back down the tree. In the following picture, we write “ShiftOp : 3 → 4” to mean that the self-determined width of the ShiftOp is 3, but its final width is 4.

Thus, we can see that the final width of the adder (the BinOp in the middle of the picture above) is 4 bits.

For the example from ChipVerify’s tutorial, the “determine” phase sets the self-determined width of the multiplication to be 8 bits, but the “propagate” phase sets its final width to 16 bits because this is the width of the left-hand side of the assignment.

I’ll do one last example, as there’s one more subtlety that is worth mentioning. Consider another little Verilog design:

logic [3:0] baz;
assign baz = (3'b110 + {2'b10, 1'b0}) >> 2;

It’s similar to our first example, but it includes a concatenation operation, in which the 2 bits of 2'b11 are placed alongside the single bit of 1'b0 to form the 3-bit value 3'b100. Drawn as a syntax tree, we have:

After running the “determine” phase, we get:

Note that the self-determined width of the concatenation operation is the sum (+) of the widths of its operands. The rest is similar to the previous example.

The difference comes when we run the “propagate” phase:

Note that the width of 4 has propagated from baz, setting the final width of the shift, the BinOp, the constant 3'b110, and the concatenation. But it has not propagated below the concatenation. The concatenation operation can thus be seen as a “boundary” to propagation. In the terminology of the Verilog standard, each operands of a concatenation operation has its bitwidth “solely determined by the expression itself” and not “by the fact that it is part of another expression”.

So there we go; that’s more-or-less how bitwidth calculations really work in Verilog. Our paper has more of the details, and more still are available in Gabriel’s Rocq development. We imagine that our work can be valuable in a couple of ways. Firstly, we expect that it can strengthen the guarantees provided by tools like Lustig and Vera, which can reason formally about the correctness about Verilog designs but currently have to assume that all bitwidths have been written out explicitly. Secondly, we hope that our work can help the maintainers of the Verilog standard come up with a way to explain bitwidth calculations with more clarity and precision, so as to minimise confusion in the future.

Screenshot 2026-05-06 at 14.47.08
wicko3
http://johnwickerson.wordpress.com/?p=2776
Extensions
Finding bugs in FPGA place-and-route tools
AcademiaComputer ScienceFPGAselectronicstechnology
This blog post is about Ollie Cosgrove’s paper (with Ally Donaldson and me) that will be presented at the 34th International Symposium on FPGAs in February. Over the last few years, I’ve enjoyed being a part of several projects that have been giving hardware design tools a hard time; that is: developing techniques for uncovering… Continue reading Finding bugs in FPGA place-and-route tools →
Show full content

This blog post is about Ollie Cosgrove’s paper (with Ally Donaldson and me) that will be presented at the 34th International Symposium on FPGAs in February.

Over the last few years, I’ve enjoyed being a part of several projects that have been giving hardware design tools a hard time; that is: developing techniques for uncovering bugs in the tools that hardware designers use to design hardware. We’ve done testing campaigns on logic synthesis tools, on high-level synthesis tools, on equivalence checkers, and on Verilog parsers – all of which have found correctness bugs in widely used commercial tools.

In this work, we’ve looked at another important tool of hardware design: FPGA place-and-route (P&R) engines. A P&R engine takes as input a netlist, which is a low-level description of a hardware design that consists of a collection of components (memories, registers, adders, logic gates, and so on) and how they are to be wired together. The job of the P&R engine is to work out how best to assign those components to the FPGA’s available computational resources (that’s the “placement” part) and how those resources should be connected (that’s the “routing” part). A P&R engine is allowed to modify the netlist if that enables a better outcome – as long as doing so doesn’t change the design’s behaviour, of course! For instance, it might decide to replicate a component in order to avoid the need for really long wires.

Ollie built a tool, called FUZNET, that generates thousands of random (but valid) netlists, fed them into AMD’s Vivado P&R engine, and checked in each case whether the pre- and post-P&R netlists are equivalent. He was able to find two bugs in Vivado 2024.2, one of which has been received by AMD but not yet confirmed, and one of which has been confirmed and fixed for the 2025.2 version. I’ll explain this second bug shortly, but first I’ll make a few remarks about FUZNET itself.

Why fuzzing P&R engines is challenging

In some ways, this was just another fuzzing project: generating lots of random inputs, feeding them to a system-under-test, and investigating any unexpected outputs. But P&R turned out to be a surprisingly interesting domain for fuzzing, for five reasons:

  1. Fuzzing is effective at uncovering bugs not because its randomly generated test cases are particularly high-quality, but because it churns through such a huge number of them, just like monkeys wielding typewriters. In other words, if fuzzing is going to work, we need to run test cases at a fast rate. This is a challenge for P&R engines, which take a long time to run! Hence, Ollie had to try to make each test case as high-quality as possible.
  2. Usually, fuzzing involves generating very large test cases, since these are likely to be challenging for the system-under-test to handle, and hence most likely to trigger bugs. But bigger is not necessarily better in a P&R context, because a P&R engine is targeting a device with a fixed set of resources. If a given test case has no hope of fitting onto the target device, the P&R engine will simply give up. Hence, Ollie had to try to generate test cases that are large enough to challenge the P&R engine, but small enough to fit on the target device.
  3. When the P&R engine does produce a netlist, we need a way to check whether it is equivalent to the netlist it was given. Even this seemingly simple task is tricky, because no single approach offers a silver bullet: simulating the pre- and post-P&R netlists with a bunch of random input vectors is not exhaustive, but trying to prove the pre- and post-P&R netlists logically equivalent using automated solving can be intolerably slow! Hence, Ollie had to come up with a clever combination of equivalence-checking techniques that would work well in the common case.
  4. If fuzzing ever does stumble upon a bug-triggering test case, we need to perform “test-case reduction”: a trial-and-error process for finding a minimal version of the test case that still triggers the bug. Most reduction algorithms involve churning through a huge number of trials, which is fine when the system-under-test produces each output quickly, but not great when testing a slow-running P&R engine! Hence, Ollie had to devise a new reduction algorithm that converges quickly.
  5. The other problem with test-case reduction arises from how the P&R engine works. Conventional software compilers produce output that is “as good as possible”, but most tools in hardware design are designed to produce output that is “as good as it needs to be”. That is: the hardware designer gives the P&R engine a set of constraints (about things like circuit area and clock frequency), and once those have been met, the P&R engine won’t bother trying to optimise any further. This complicates test-case reduction, because if removing a chunk from a failing test-case leads to it now passing, we don’t know if that’s because that particular chunk was the crux of the bug, or simply because the P&R engine didn’t need to invoke so many optimisations on the smaller test-case! Hence, Ollie had to make sure that when reducing a test-case, he also upped the constraints by the right amount so that the same optimisations would still be performed.
Explaining the bug

Here’s a picture of the confirmed-and-fixed bug in AMD Vivado’s P&R engine.

The netlist at the top is the one given to the P&R engine; the P&R engine changes it to the one at the bottom, but the two are not quite equivalent. The explanation is a little involved, but I’ll try. The brown “CFGLUT5” component is a “5-input dynamically reconfigurable look-up table”, which means that it is essentially a table with 32 rows, and it uses five inputs, I4I0, to select which of those rows should be output (on O5 and O6). In the pre-P&R netlist, the input vector is (0,0,0,0,1), which selects row 1. However, Vivado has noticed that the table happens to be initialised so that rows 0 and 1 contain the same data, so it chooses to transform the circuit to use a simpler input vector, (0,0,0,0,0), which selects row 0.

However, this reasoning fails to account for the “dynamically reconfigurable” nature of the CFGLUT5. Normal look-up tables have their entries fixed at initialisation, but those in a CFGLUT5 can be changed at runtime. This is done by setting the CE signal and then feeding the new entries serially through the CDI port. In the pre-P&R netlist, the CE signal is constantly set, which means that the CFGLUT5 is continually being reconfigured, using whatever data comes in on input _02_ (top left). The behaviour of the CFGLUT5 component is thus completely unknowable to the P&R engine, and as such, the P&R engine cannot safely make any changes to the CFGLUT5’s inputs. Thus, in particular, changing the input vector from (0,0,0,0,1) to (0,0,0,0,0) is not sound.


In conclusion: once again, fuzzing turns out to be effective at finding subtle bugs in mature and widely used tools. There are plenty of ways FUZNET could be extended to find more bugs, e.g. by incorporating more advanced components into the netlists it generates. It would also be nice to expand the class of P&R bugs that FUZNET is capable of finding: currently, it only checks the changes that the P&R engine has made to the netlist, but there could also be bugs in the actual placement and routing that the engine comes up with – devising techniques that are capable of uncovering these is a job for future work.

Screenshot 2025-12-24 at 11.40.24
wicko3
http://johnwickerson.wordpress.com/?p=2757
Extensions
Are conference reviewers harsher when they have a submission of their own?
AcademiaComputer Scienceresearch
My area of academia runs mainly on conferences, as opposed to journals. This means that a few times each year, hundreds of researchers simultaneously submit papers about their latest and greatest project, and a “program committee” decides which of those get to appear in the proceedings of the upcoming conference. There is usually considerable overlap… Continue reading Are conference reviewers harsher when they have a submission of their own? →
Show full content

My area of academia runs mainly on conferences, as opposed to journals. This means that a few times each year, hundreds of researchers simultaneously submit papers about their latest and greatest project, and a “program committee” decides which of those get to appear in the proceedings of the upcoming conference.

There is usually considerable overlap between the members of that program committee and the researchers who submit papers. This is partly because if conferences didn’t allow the members of their program committee also to submit their own papers, few would be willing to serve on that committee. (But of course, nobody reviews their own submission!)

It seems to me that if a member of the program committee has a submission of their own under review, and wishes to act purely selfishly, then they should be as negative as possible about all the papers they are reviewing. Thus, the argument goes, they should increase the chances of their own paper ending up near the top of the rankings and ultimately being accepted.

To what extent does this actually happen?

I accessed some data for a recent computer science conference. I partitioned the 73 members of the program committee into the 47 who had no submission of their own and the 26 who had at least one. (Only a handful of reviewers had more than one submission, so I lumped them all together.)

I also looked at the average scores given by all those reviewers. Each reviewer reviewed between one and five papers, and scored each one on a 4-point scale (strong reject, weak reject, weak accept, strong accept). So I took each reviewer’s “mean review score” by a weighted average (strong reject = -1.5, weak reject = -0.5, weak accept = 0.5, strong accept = 1.5).

I found that the mean review scores given by the two groups are as follows:

No paper under review-0.117At least one paper under review-0.478

Both numbers are negative, which means both groups of reviewers lean towards rejecting rather than accepting – what grumps! But those that have a paper of their own under review seem noticeably more negative than those that don’t.

Here’s a graphical version, which shows a bit more information:

The graph shows the distribution of mean review scores for the two groups (blue line = no submission, green line = at least one submission). The dots are the actual data points after dividing the review scores into buckets of width 0.2; the lines are moving averages (window = 2) which smoothes out the noise a bit. We see that both distributions are vaguely Gaussian, peaking a bit below a score of 0, but that the green line is noticeably skewed to the left (more negative scores).

I also calculated the Mann–Whitney U-test on this data, which promises to quantify whether this apparent skew is actually statistically significant. The number crunching says yes, there is a statistically significant effect here: there’s only about a 0.4% chance that this skew is simply caused by random noise.

So that’s quite interesting. Reviewers do seem to be harsher when they have a paper of their own under submission.

Of course, it’s perfectly possible that having a paper under submission doesn’t actually cause a reviewer to be harsher, but rather, that there is a third factor causing both of these things to be simply correlated.

One candidate factor is the seniority of the reviewer. Senior professors might give lower review scores because they’re more familiar with the research area and hence find it easier to spot missing citations, etc.; they may also be more likely to have papers under submission from their large research group. Or it might go the other way: senior professors might give higher scores because they gloss over the little flaws that junior folk pick up on, and they may be less likely to have papers under submission because they’re not focusing so urgently on building up their publication profile.

But no, it would seem not. Here’s a graph that shows the relationship between how many publications a reviewer has co-authored (according to their DBLP profile) and their review scores.

Visually, there seems to be no relationship, and sure enough, Spearman’s coefficient of rank correlation confirms that there is no statistically significant correlation here.

The same applies if we use “number of years since first publication” as another proxy measure for seniority, again using data from DBLP:

Again, there’s no visual correlation here, and Spearman agrees. So seniority doesn’t appear to be a confounding factor here.

Another factor to consider is whether the reviewer is affiliated with a University or not. Perhaps reviewers who are affiliated with a company or a government organisation are not so focused on growing their publication count, and hence may feel less “competitive” when reviewing.

My data says that yes, this does indeed seem to be an important factor. The graph below plots the scores given by reviewers currently affiliated with a University (blue line) and reviewers not currently affiliated with a University (green line).

It looks like the University-affiliated reviewers are harsher (mean=-0.329) than their non-University-affiliated counterparts (mean=0.027). The Mann–Whitney U-test reports a 4% chance that this is just down to random noise, so there’s just about a statistically significant effect here.

What’s certainly the case is that the non-University-affiliated reviewers are not submitting many papers: of the 17 reviewers, only one of them had a submission of their own, whereas the 56 University-affiliated reviewers had an average of 0.64 submissions.

One way to proceed here would simply be to discount all the non-University-affiliated reviewers. Here’s the graph obtained from doing that:

We see that the University-affiliated reviewers with a submission of their own (green line) are still noticeably harsher than those without (blue line). And the Mann–Whitney U-test confirms that there’s still a statistically significant effect going on here: there’s only a 2% chance that this is just down to random noise.

So, what to do about this?

One could imagine attaching a little “badge” to reviewers’ names in the discussion forums that indicates “I have at least one paper under submission myself”. That might discourage them from being overly negative about a paper. Then again, reviewers might overcompensate and become too positive in their reviews, for fear of being accused of acting in a self-interested manner!

A more subtle approach could be to make this information visible to program chairs while papers are being allocated to reviewers, so that they can try to ensure that each submission has a balanced number of reviewers from both groups.

wicko3
http://johnwickerson.wordpress.com/?p=2725
Extensions
Fuzzing Quantum Compilers
AcademiaComputer ArchitectureComputer ScienceProgramming Languagesquantum-computingtechnology
Ilan Iwumbwe and Benny Liu did undergraduate research placements with me this summer, and I’m very pleased that they will be presenting their work at the Programming Languages for Quantum Computing (PLanQC) workshop at POPL next month. Ilan and Benny built a tool called QuteFuzz that randomly generates descriptions of quantum circuits. The idea is… Continue reading Fuzzing Quantum Compilers →
Show full content

Ilan Iwumbwe and Benny Liu did undergraduate research placements with me this summer, and I’m very pleased that they will be presenting their work at the Programming Languages for Quantum Computing (PLanQC) workshop at POPL next month.

Ilan and Benny built a tool called QuteFuzz that randomly generates descriptions of quantum circuits. The idea is to use these randomly generated circuits to fuzz-test quantum compilers, such as Quantinuum’s Pytket, IBM’s Qiskit, and Google’s Cirq. We want to check that they don’t crash and that any optimisations they do on the quantum circuits don’t change their behaviour.

QuteFuzz is not the first random generator of quantum circuits – there exist a handful of tools that do this already, such as QDiff and MorphQ – but it does aim to test a couple of language features that prior tools don’t reach. Specifically, QuteFuzz focuses on producing random quantum circuits that contain subcircuits and control flow.

  • Subcircuits are analogous to subroutines in ordinary programming: they allow a piece of computation to be defined once and then instantiated many times.
  • Control flow refers to quantum circuits where the qubits can pass through different sequences of quantum gates depending on the outcome of a measurement.

Below is an example of a quantum circuit generated by QuteFuzz. I won’t explain it line-by-line, but I’ve labelled the part that defines a subcircuit and the parts that define if-statements for control flow.

Here is a graphical representation of that same circuit. We see the subcircuit (circuit-163), the one-armed if-statement (If-0 to End-0), and the two-armed if-statement (If-1 to Else-1 to End-1).

The natural way to deploy these randomly generated quantum circuits is to use differential testing: checking that the circuits behave the same way if compiled by two different compilers, or at two different optimisation levels. The problem with this approach is that two circuits producing different final qubit values doesn’t necessarily indicate a compiler bug – it could just be the inherent non-determinism of quantum computing. One workaround for this is to simulate the circuits hundreds of times, until we are reasonably confident (via the Kolmogorov–Smirnov test) that the two circuits are using the same probability distribution of final qubit values.

For example, the graph below shows the final qubit values observed from 1024 simulation runs of that randomly generated circuit when compiled using Qiskit with no optimizations. The output is “16” on 790 of the runs, and “48” on all the rest.

The next graph shows the result of simulating that same circuit having compiled it with the same compiler at optimisation level 3. It is immediately obvious that the qubit values are following a completely different distribution. This turned out to be the result of an invalid optimisation being performed by the Qiskit compiler.

All in all, QuteFuzz found seventeen bugs in the quantum compilers that Benny and Ilan tested – some miscompilations, some crashes, and some in the simulators that were used to produce graphs like the ones above. All but two of those bugs have since been confirmed by the developers, and three have already had fixes deployed.

For more information about QuteFuzz, check out Benny and Ilan’s Github repo and their paper.

qutefuzz
wicko3
http://johnwickerson.wordpress.com/?p=2653
Extensions
The Hoare Cube
AcademiaComputer ScienceProgramming LanguagesSeparation LogicMathematics
I wrote earlier this year about my attempt to understand the repercussions of toggling and when giving a semantics to Hoare triples. In response to that post, Yann Herklotz helpfully pointed me to Patrick Cousot’s POPL 2024 paper [10], which does all this and a lot more besides. In particular, Cousot’s paper includes the following… Continue reading The Hoare Cube →
Show full content

I wrote earlier this year about my attempt to understand the repercussions of toggling \subseteq and \supseteq when giving a semantics to Hoare triples. In response to that post, Yann Herklotz helpfully pointed me to Patrick Cousot’s POPL 2024 paper [10], which does all this and a lot more besides. In particular, Cousot’s paper includes the following diagram…

…which shows how to obtain dozens of program logics just by toggling a handful of switches in the semantics.

However, that diagram is completely terrifying, and (just between you and me) I can’t get my head around much of the rest of Cousot’s paper. So, this blog post is my attempt to end up with a similar diagram to Cousot’s one, but while trying to keep things as simple as possible.


Let us assume that we have a set of states, and that a command is simply a binary relation on states. We shall write (\sigma, \sigma') \in c to mean that if command c is started in state \sigma it can produce final state \sigma'. Our commands need not always terminate: if there is no \sigma' such that (\sigma, \sigma') \in c, then that means c does not terminate when started in state \sigma. Also, our commands can be non-deterministic: if (\sigma, \sigma') \in c and (\sigma, \sigma'') \in c with \sigma' \neq \sigma'', then that means c behaves non-deterministically when started in state \sigma.

The original “partial correctness” interpretation [1] of the Hoare triple \{P\}\,c\,\{Q\} is that whenever c is started in a state satisfying the assertion P (an assertion is just a set of states), any final state it reaches must satisfy the assertion Q. That is…

\forall \sigma, \sigma'\ldotp \sigma \in P \wedge (\sigma,\sigma') \in c \longrightarrow \sigma' \in Q.

Let us define a helper function: the “strongest postcondition”. We shall write \text{sp}(P,c) for the set of all states that command c can reach if it starts executing in a P state…

\text{sp}(P, c) = \{\sigma'\ldotp \exists\sigma\ldotp (\sigma,\sigma') \in c \wedge \sigma \in P\}.

We can use \text{sp} to give a semantics to Hoare triples in a variety of different ways. For instance, the partial correctness interpretation given above can be rewritten as…

\text{sp}(P,c) \subseteq Q.

Flipping the \subseteq into a \supseteq to get Incorrectness logic…

\text{sp}(P,c) \supseteq Q

…is what led O’Hearn to remark that “program correctness and incorrectness are two sides of the same coin”. But there are other components that can be “flipped” too. I think there are three “dimensions” in total:

  • Reach: we can flip from a may semantics (“the states c can reach if…”) to a must semantics (“the states c can only reach if…”)
  • Direction: we can use c as-is (a forward semantics), or we can invert it to c^{-1} (a backward semantics).
  • Approximation: we can flip between \subseteq (where the command’s behaviour is “over approximated”) and \supseteq (where the command’s behaviour is “under approximated”).

(NB: Cousot seems to identify four dimensions in his diagram above, but I haven’t understood what his fourth one is.)

The eight options, then, are as follows:

Name(s)Meaning of \{P\}\,c\,\{Q\}Reach
Direction
Approximation
Partial correctness [1]\text{sp}(P,c) \subseteq Qmay
forward
overReverse Hoare logic [5],
Incorrectness logic [6]
\text{sp}(P,c) \supseteq Qmay
forward
underNecessary preconditions (as formulated in [4])\text{sp}(Q,c^{-1}) \subseteq Pmay
backward
overBackward under-approximate Hoare triple [2],
Lisbon triple [3],
Sufficient incorrectness logic [4],
Possible correctness [9]\text{sp}(Q,c^{-1}) \supseteq Pmay
backward
underNecessary preconditions (as originally formulated in [7])\text{sp}(\neg P,c) \subseteq \neg Qmust
forward
over\text{sp}(\neg P,c) \supseteq \neg Qmust
forward
underPartial correctness (alternative formulation)\text{sp}(\neg Q,c^{-1}) \subseteq \neg Pmust
backward
over\text{sp}(\neg Q,c^{-1}) \supseteq \neg Pmust
backward
under

Note that partial correctness can also be rewritten into the more familiar P \subseteq \text{wlp}(c, Q) if we define \text{wlp} as the “weakest liberal precondition” in the usual way…

\text{wlp}(c,Q) = \{\sigma\ldotp \forall\sigma'\ldotp (\sigma,\sigma') \in c \longrightarrow \sigma' \in Q\}

…but I quite like how “systematic” the table above is: it distinguishes may from must by negating the pre and postconditions; it distinguishes forward from backward by using c or c^{-1}; and it distinguishes over from under approximate by using \subseteq or \supseteq.

Note also that I’ve labelled \text{sp}(\neg P,c) \subseteq \neg Q as the “original” formulation of necessary preconditions, as it captures the two negatives in Cousot et al.’s definition quite nicely: “under which precondition, if violated, will the program always be incorrect?” [7, my emphasis].


We can assemble all eight options in the table into a cube — a “Hoare cube”, if you will.

In the cube, I’ve drawn solid lines between vertices that represent equal semantics. For instance, may+forward+over is equal to must+backward+over because…

\text{sp}(P,c) \subseteq Q \quad\Longleftrightarrow\quad \text{sp}(\neg Q,c^{-1}) \subseteq \neg P.

Sometimes these equalities can be phrased as Galois connections. For instance…

\text{sp}(P,c) \subseteq Q \quad\Longleftrightarrow\quad P \subseteq \text{wlp}(c, Q)

…signifies a Galois connection between \text{sp}(-,c) and \text{wlp}(c, -).

I’ve also drawn dashed lines between vertices that represent equivalent semantics. By “equivalent” here, I mean that there exists a bijection f between triples such that the triple \{P\}\,c\,\{Q\} holds according to one semantics if and only if the triple f(\{P\}\,c\,\{Q\}) holds according to the other semantics. What this means is that one strategy for proving \{P\}\,c\,\{Q\} under one semantics is to prove f(\{P\}\,c\,\{Q\}) under the other semantics.

For instance, may+backward+over (necessary preconditions, \text{NC}) is equivalent to may+forward+over (partial correctness, \text{PC}) because…

\text{NC} \models \{P\}\,c\,\{Q\} \quad\Longleftrightarrow\quad \text{PC} \models \{\neg P\}\,c\,\{\neg Q\}

…as Ascari et al. have observed [4, Proposition 5.4]. Here we use the bijection f_1 that maps \{P\}\,c\,\{Q\} to \{\neg P\}\,c\,\{\neg Q\}.

For another instance, may+forward+under (incorrectness logic, \text{IL}) is equivalent to may+backward+under (sufficient incorrectness logic, \text{SIL}) because, using another bijection f_2 that maps \{P\}\,c\,\{Q\} to \{Q\}\,c^{-1}\,\{P\}, we have…

\text{IL} \models \{P\}\,c\,\{Q\} \quad\Longleftrightarrow\quad \text{SIL} \models \{Q\}\,c^{-1}\,\{P\}.

For this reason, I’ve deemed all of the over nodes equivalent (and coloured them green) and all of the under nodes equivalent (and coloured them blue).

But is it a bit of a stretch to consider all those same-coloured nodes “equivalent”? I mean, if the point of these bijections is to suggest how a proof in one logic can be transformed into a proof in another logic, then are these transformations reasonable? The bijection f_2, for instance, requires the ability to invert an arbitrary command (turning c into c^{-1}). That’s straightforward enough for simple programs, thanks to properties like (c; d)^{-1} = (d^{-1} ; c^{-1}) and (c + d)^{-1} = (c^{-1} + d^{-1}) and (c^*)^{-1} = (c^{-1})^*, but it could become infeasible in more complicated settings. Likewise, the bijection f_1 requires the ability to negate an arbitrary assertion (turning P into \neg P), which is fine in ordinary predicate logic but doesn’t work very nicely in a substructural logic like separation logic.

Still, it does give a pretty handy way to generate proof rules for one logic from the proof rules for another logic. For instance, Ascari et al. [4] present this proof rule from incorrectness logic…

…and if we apply our bijection f_2 to all the triples in this rule, then we obtain “for free” the corresponding proof rule from their sufficient incorrectness logic…

…without having to prove anything about its soundness.

To conclude: perhaps program correctness and incorrectness are not simply two sides of the same coin, but two opposite faces of the same cube!

Related work

Möller et al. [2, Section 5] define and briefly study “backward under-approximate Hoare triples”. Zilberstein et al. [3] refer to those triples also as “Lisbon triples”. Ascari et al. [4] define sufficient incorrectness logic as a proof system for those “backward under-approximate Hoare triples”, and I’m grateful to Ike Mulder who referred me to Ascari et al.’s paper in a comment on my earlier blog post.

Zilberstein et al. [3] defines and studies “Outcome logic”, but I haven’t included it in my “cube” because I don’t think it quite fits with the other logics – its postconditions are not simply “sets of states”, but have a more complicated monoidal structure. I also haven’t included “Exact separation logic” [11], which unifies over-approximate and under-approximate reasoning in the context of separation logic.

Zhang and Benjamin Kaminski [8, Section 6.3] also studied the eight options in my table above, similarly obtained by toggling three binary switches (albeit with slightly different names), and I’m grateful to Benjamin for pointing me to his work in a comment on my earlier blog post. I’ve reproduced their table here:

The two blue rows in their table are equal, and they correspond to the two “partial correctness” vertices in my cube. The two orange lines are also equal, and they correspond to the two “necessary preconditions” vertices in my cube. I also haven’t got names for the ??? and ¿¿¿ logics, but I note that ??? can be straightforwardly obtained from sufficient incorrectness logic via the f_1 bijection I defined above, and ¿¿¿ can be obtained from incorrectness logic in the same way, so I think this makes it unlikely that those logics will offer anything particularly new.

For the record: one potential point of confusion in Zhang and Kaminski’s table is that what they call “wp” is actually what O’Hearn [6, Section 6.3] calls the “weakest possible precondition”: the sets of states from which c can reach a final state satisfying the postcondition (but due to non-determinism, might not). Thus, I think “total correctness” should probably be read as “possible correctness”, following Hoare’s terminology [9, Section 5.3].

Bibliography
Screenshot 2024-12-16 at 11.22.46
wicko3
http://johnwickerson.wordpress.com/?p=2571
Extensions
Fun with income tax
Miscellaneous
The UK government imposes a tax on people’s income, and, as is quite conventional, the rate at which a person pays this tax increases as their income increases. However, I was surprised to notice recently that this rate does not increase monotonically with income. Here is how the UK government explains the calculation of income… Continue reading Fun with income tax →
Show full content

The UK government imposes a tax on people’s income, and, as is quite conventional, the rate at which a person pays this tax increases as their income increases.

However, I was surprised to notice recently that this rate does not increase monotonically with income.

Here is how the UK government explains the calculation of income tax:

So far, so simple. You pay no tax on the first £12,570 of your income, then you pay 20% of the next £37,700, then 40% of the next £74,870, and finally 45% of whatever’s left.

However, there is an additional complexity not captured by the table above. Once your income hits £100,000, your Personal Allowance (the portion of your income on which you pay no tax) starts to go down. It reduces by £1 for every £2 your income is over £100,000, eventually stopping at £0 once your income hits £125,140.

It’s not completely clear how the bands in the table above are supposed to change if the Personal Allowance reduces. Should the Basic-rate band stretch as the Personal Allowance band contracts? Should the Basic-rate band slide down and the Higher-rate band stretch? Or should both the Basic-rate and Higher-rate bands slide down and the Additional-rate band kick in earlier?

I believe the second option is the correct one. This is because a separate government web page indicates how the bands are supposed to work when the Personal Allowance is £0, and it shows the Basic rate only applying for the first £37,700 of your income, and the Additional rate still kicking in over £125,140. (Ok, technically “over £125,141”, but I think that’s just a typo.) That means it must be the Higher-rate band stretching.

So, as your income increases within the £100,000–£125,140 region, not only does more of it lie in the 40%-tax band, but also some of your income moves out of the no-tax band into the 40%-tax band. When these two effects are combined, it turns out that the tax paid on income in the £100,000–£125,140 region is actually 60%, which is considerably higher than the 45% tax paid on income above that region!

The graph below shows the effect quite clearly. The x-axis shows a person’s income and the y-axis shows the total tax they pay out of that income. See how the blue line turns upwards at x=£100,000, and then downwards again at x=£125,140.

I think there are a couple of ways that income tax rates could be explained more clearly. One way would be to abolish the Personal Allowance reduction, and instead introduce an extra 60% band into the table. That would give:

BandTaxable incomeTax ratePersonal AllowanceUp to £12,5700%Basic rate£12,571 to £50,27020%Higher rate£50,270 to £100,00040%Even higher rate£100,001 to £125,14060%Additional rateover £125,14045%

A perennial problem with presenting tax rates like this is that it’s easy to forget that each tax rate only applies to the part of the income that lies in the relevant band. For instance, it’s easy to think that if your income is £100,000, you pay 40% of it in tax (when the reality is that you’d only pay about 27%).

One alternative presentation, which tries to address this somewhat, is as follows:

BandIf your income is…Then you keep…Personal AllowanceUp to £12,570All of itBasic rate£12,571 to £50,270The first £12,570, plus 80% of the restHigher rate£50,270 to £100,000The first £31,420, plus 60% of the restEven higher rate£100,001 to £125,140The first £54,280, plus 40% of the restAdditional rateover £125,140The first £30,660, plus 55% of the rest

Then again, we could be a bit more radical, and get rid of the bands altogether. Here’s one possible formula for calculating your tax rate (as a percentage) using the exponential function.

\text{tax rate} = 50 \times ( 1 - \exp(-\text{income} \div 120000))

That is: take the negative of your income (in £), divide it by 120000, calculate the exponential function of the result, subtract that from 1, and finally multiply by 50. Yes it’s a little convoluted, but it does avoid the problems of the band-based system that I explained above. And with the numbers chosen like I have done, the tax paid using the exponential formula (green line in the graph below) is remarkably similar to the tax paid in the current system (blue line):

Taxable incomeTotal tax paid (band system)Total tax paid (my formula)£1,0000£4£10,0000£400£20,000£1,486£1,535£30,000£3,486£3,318£40,000£5,486£5,669£80,000£19,432£19,463£120,000£39,432£37,927£160,000£58,203£58,912£200,000£76,203£81,112

One feature of my formula-based proposal is that there’s no Personal Allowance – income tax is paid on all income, no matter how small. But perhaps it’s good for social cohesion if people can see that the same formula applies to everybody’s income, no matter how high or low it is. Another feature of my proposal is that very high earners do end up paying a bit more – asymptoting to 50% of their income eventually. But perhaps once their income is that high, they won’t notice contributing a bit more of it! 😉


Addendum. A couple of people have mentioned National Insurance. This is an additional tax on income: 0% of any income between £0 and £12,570, plus 8% of any income between £12,570 and £50,270, plus 2% of any income above £50,270. It could be abolished and rolled into income tax. In my formula-based approach, it is a simple matter of changing one of the numbers:

\text{tax rate} = 50 \times ( 1 - \exp(-\text{income} \div 100000))

The graph above would then become:

Screenshot 2024-11-15 at 10.59.48
wicko3
http://johnwickerson.wordpress.com/?p=2536
Extensions
Who checks the equivalence checkers?
AcademiaComputer Science
This post provides a short introduction to a paper that will be presented at DVCon in Munich next month by my PhD student Michalis Pardalos about work we have done with collaborators Alastair Donaldson, Emi Morini and Laura Pozzi. Hardware engineers take great care to ensure that their designs are correct. Unlike bugs in software,… Continue reading Who checks the equivalence checkers? →
Show full content

This post provides a short introduction to a paper that will be presented at DVCon in Munich next month by my PhD student Michalis Pardalos about work we have done with collaborators Alastair Donaldson, Emi Morini and Laura Pozzi.

Hardware engineers take great care to ensure that their designs are correct. Unlike bugs in software, which can often be easily fixed with a quick patch, mistakes in hardware can be exceedingly difficult and costly to address. As such, it is vitally important that hardware synthesis tools, which take human-readable hardware designs and transform them into lower-level representations that are readily implemented in silicon, are correct.

In fact, hardware engineers routinely “double check” the working of these synthesis tools by feeding the input they were given and the output they produced to an “equivalence checker“. An equivalence checker is a tool that takes two hardware designs, perhaps one written in a high-level language like SystemC and one written in a lower-level language like Verilog, and checks whether the two designs really are equivalent; i.e., that they behave in the same way in all possible situations.

The results from these equivalence checkers are deeply trusted, even to the point of authorising a design to be “signed-off” for tape-out and fabrication. Indeed, the Synopsys marketing blog claims that their HECTOR equivalence checker:

delivers 100% confidence that the RTL design implementation conforms to the C/C++ reference algorithm, thereby significantly speeding up signoff for datapath components as compared to simulation-based techniques.

while Siemens claim that their SLEC equivalence checker:

enables designers to have the ultimate confidence to move to high-level synthesis […] dramatically reducing or eliminating the need for design teams to perform simulation/verification of RTL.

My PhD student Michalis Pardalos has been examining how trustworthy these equivalence checkers really are.

He has built a fuzzing tool called Equifuzz, which generates random programs in the SystemC language. (SystemC extends C++ with variable-width datatypes and a few other things, and it is widely used to specify hardware.) There are a few nifty features that make Equifuzz well suited for the task of testing equivalence checkers:

First, it generates SystemC programs in an iterative fashion, maintaining well-formedness at each step. For instance, here is one possible path it could take:

  1. Start with random seed value of -1:
    int dut() {
      return -1;
    }
  2. Add a cast to a 10-bit fixed-point number with an 8-bit integer part (and hence a 2-bit fractional part):
    sc_dt::sc_fixed<10,8> dut() {
      sc_dt::sc_fixed<10,8> x0 = sc_dt::sc_fixed<10,8>(-1);
      return x0;
    }
  3. Add a cast to a 8-bit unsigned integer:
    sc_dt::sc_uint<8> dut() {
      sc_dt::sc_fixed<10,8> x0 = sc_dt::sc_fixed<10,8>(-1);
      sc_dt::sc_uint<8> x1 = sc_dt::sc_uint<8>(x0);
      return x1;
    }

Generating programs in an iterative fashion like this means that test-case reduction becomes rapid – a simple matter of deleting as many of the transformations as possible while preserving the bug. Rapidly converging test-case reduction is always nice, but is especially vital when testing equivalence checkers because they can be quite slow to produce a result, often taking a minute or so.

Second, it generates input-free SystemC programs that calculate a single numerical result. Prior fuzzers like Csmith and YARPGen do the same, but what’s useful about this when testing equivalence checkers is that it is often possible to rephrase “false negative” bugs as “false positives”. To see this, consider the last SystemC program listed above. Compiling it with GCC and then executing the result tells us that it should produce the output 255. Then suppose we give that program to an equivalence checker, together with a Verilog design that produces the correct output:

module top (output [7:0] out);
 
 assign out = 8'd255;
endmodule


If the equivalence checker wrongly reports “not equivalent”, then we have found a false negative bug. This is valuable, but a false negative is a “failsafe” kind of bug: it will simply prevent the hardware engineer from proceeding. However, the equivalence checker usually produces a counterexample in its output log, and by inspecting this we may find that it thinks our SystemC program actually produces a different output, say 252. It stands to reason, then, that if we ask the equivalence checker if our SystemC program is equivalent to this Verilog design:

module top (output [7:0] out);
 
 assign out = 8'd252;
endmodule


then we should observe a false positive bug. A false positive is a more critical kind of bug because it could lead a hardware engineer to think that it is ok to proceed when it is not.

Third, Equifuzz doesn’t avoid undefined behaviour. Most compiler fuzzers like Csmith or YARPGen go to considerable effort to avoid generating programs that contain undefined behaviour, because such programs are useless for testing compilers – a compiler can do anything it likes with them! The situation is different for equivalence checkers, though. A program that contains undefined behaviour must never be deemed “equivalent” – even to itself!

In all the testing Michalis has done to date, Equifuzz has found seven false-positive bugs in three major commercial ECs (where the checker claims equivalence incorrectly). Equifuzz has additionally found five “incompleteness” bugs (where the checker failed to prove equivalence between equivalent designs).

One of the false-positive bugs is the example given above, where the correct output is 255 but one of the equivalence checkers we tested reported the output as being 252. The crux in this case is the implementation of the cast from sc_fixed to sc_uint. The SystemC specification says that this cast should discard the fractional part of the fixed point number before interpreting the integer part as an sc_uint, but this particular equivalence checker was instead interpreting the entire bitvector.

As a second example of a bug we found, the following SystemC program (which happens not to use any SystemC-specific features) should produce 0x3FF0000000000000:

double dut() {
 
 int x0 = 1;
 
 return double(bool(x0));
}


but another equivalence checker we tested deemed it equivalent to Verilog that produces 0x0000000000000001 instead. This equivalence checker seems to be conflating the representation of “1” as a 32-bit integer and the representation of “1” as a 32-bit double, which has a very different bitvector.

For more details about Equifuzz and the bugs it found, check out Michalis’s paper, or say hello to him at DVCon in Munich next month!

tingey-injury-law-firm-yCdPU73kGSc-unsplash
wicko3
http://johnwickerson.wordpress.com/?p=2510
Extensions
Automated feature testing of Verilog parsers using fuzzing
Computer ScienceProgramming Languages
I’m delighted that Quentin Corradi, a PhD student I jointly supervise with George Constantinides, will be presenting his work to improve the reliability of hardware design tools next week at the FUZZING’24 workshop, a satellite event of the ISSTA conference. The Verilog language is widely used in hardware design, and is accepted by a multitude… Continue reading Automated feature testing of Verilog parsers using fuzzing →
Show full content

I’m delighted that Quentin Corradi, a PhD student I jointly supervise with George Constantinides, will be presenting his work to improve the reliability of hardware design tools next week at the FUZZING’24 workshop, a satellite event of the ISSTA conference.

The Verilog language is widely used in hardware design, and is accepted by a multitude of tools, including synthesisers, simulators, and equivalence checkers. Some are open-source, such as ANTLR4, hdlConvertor, Icarus Verilog, Moore, Surelog, Slang, sv-parser, sv2v, Tree-sitter-verilog, Verible, Verilator, Verismith, and Yosys, while others are proprietary, such as Cadence Conformal, Lattice Diamond, Siemens FormalPro, Cadence Genus, Cadence Jasper, Siemens Leonardo Spectrum, Siemens Oasys-RTL, Intel Quartus, Siemens Questasim, AMD Vivado, and Cadence Xcelium Parallel Logic Simulation.

However, most of these tools do not accept the entire Verilog language. And not only do they not accept the entire Verilog language: they often do not make it clear exactly which features they do or don’t accept.

This situation can make it difficult to transfer a Verilog design from one tool to another. And, according to Dave Rich, the technical chair of the IEEE SystemVerilog Working Group committee, it even leads some people to stick with Verilog-1995 syntax, abandoning all the language features that have been standardised since then, simply because feature support is so inconsistent between different tools and different vendors.

Some effort has already been made to document the feature support of Verilog-consuming tools. The SV-Tests project, for instance, does just this for a range of open-source Verilog-consuming tools. However, support for each language feature is assessed only by running a small set of handwritten tests, so it is hard to be confident that a feature really is fully supported, even when it is reported as such. For instance, SV-Tests reports that a Verilog parser called “Surelog” has full support for a Verilog language feature called “escaped identifiers”. However, our pilot experiments quickly identified that the following (valid) Verilog design:

module \TEST ;
endmodule

is (incorrectly) rejected by Surelog. (The root cause is actually quite amusing. Surelog uses the following regular expression for escaped identifiers:

ESCAPED_IDENTIFIER: '\\' [WS\r\t\n]*? WS;
WS: [ ]+;

The intention here is that an escaped identifier begins with a backslash (\), then has a sequence of characters that are not spaces (WS) or line breaks (r or n) or tabs (t), and finally ends with a space (WS). The problem is that when WS appears inside the square brackets, it is interpreted not as an abbreviation for [ ]+, but as the individual characters W and S! As such, Surelog does not support escaped identifiers that contain an uppercase W or an uppercase S, such as \TEST in the example above.)

As such, we are working to do more thorough testing of Verilog feature support with the help of grammar-based fuzzing. Starting from a comprehensive grammar of the Verilog language that we have captured from the official standard, we generate lots of syntactically valid but otherwise random Verilog designs, and see which ones are accepted by which Verilog-consuming tools.

In parallel with this “acceptance testing” (i.e. testing which valid inputs are correctly accepted), we are also doing some “rejection testing” (i.e. testing which invalid inputs are correctly rejected). Rejection testing may seem less critical; after all, surely it is a good thing if a tool does something useful with input that lies outside of the standardised Verilog language? But this is not always the case: it could lead to a hardware engineer wrongly thinking their design is standard-compliant and running into problems if they later feed it to a different Verilog-consuming tool. Here’s one rather trivial example that our pilot experiments identified:

module top #(parameter x = 0,);
endmodule

This is technically not valid Verilog, thanks to the trailing comma in the parameter list. But it is accepted by Yosys.

I have talked above about “pilot experiments” because the paper Quentin is presenting at the FUZZING workshop next week is not the end of the story. Rather, it is a registered report – commonplace in the traditional sciences but fairly radical in computing research. His paper details the fuzz-testing experiments that he plans to run, which means that the workshop committee has accepted it based on the strengths of his hypotheses and his proposed method, rather than on how many bugs he happened to stumble upon. Stay tuned to find out how many bugs his testing campaign on the tools listed at the top of this post actually reveal!

Screenshot 2024-09-09 at 14.19.11
wicko3
http://johnwickerson.wordpress.com/?p=2497
Extensions
In praise of Freddy, My Love
Music
I was singing Freddy, My Love as a lullaby for my toddler last night, and enjoying the nifty rhyming triplets (catches–patches–matches, and so on), when I suddenly realised that the song has almost twice as many rhymes as I had previously thought! The last three lines of each verse not only rhyme at their ends,… Continue reading In praise of Freddy, My Love →
Show full content

I was singing Freddy, My Love as a lullaby for my toddler last night, and enjoying the nifty rhyming triplets (catches–patches–matches, and so on), when I suddenly realised that the song has almost twice as many rhymes as I had previously thought! The last three lines of each verse not only rhyme at their ends, but they rhyme half way through as well! It made me give an audible gasp (not ideal while lullabying) so I thought it worth sharing.

Here are the lyrics. The rhymes are in bold; the ones in blue are new to me. They mean that each verse ends with an AB–AB–AB pattern.

Freddy, my love,
I miss you more than words can say.
Freddy, my love,
Please keep in touch while you’re away.
Hearing from you can make the day so much better,
Getting a souvenir or maybe a letter,
I really flipped over the grey cashmere sweater!

Freddy, you know,
Your absence makes me feel so blue.
That’s okay though,
Your presents make me think of you.
My ma will have a heart attack when she catches
Those pedal-pushers with the black leather patches.
Oh, how I wish I had a jacket that matches!

Don’t keep your letters from me,
I thrill to every line.
Your spelling’s kinda crummy,
But honey, so is mine.
I treasure every gifty.
The ring is really nifty.
You say it cost you fifty,
So you’re thrifty
I don’t mind!

Freddy, you’ll see,
You’ll hold me in your arms someday,
And I will be
Wearing your lacy lingerie.
Thinking about it, my heart’s pounding already.
Knowing when you come home, we’re bound to go steady
And throw your service pay around like confetti!

Of course, tinkerers gotta tinker, so here’s my attempt to force even more rhymes into each verse (my changes are in red). To be clear, I think Jim Jacobs and Warren Casey’s original lyrics work better overall, but it’s rather fun to see an ABC–ABC–ABC rhyming pattern.

Freddy, my love,
I miss you more than words can say.
Freddy, my love,
Please keep in touch while you’re away.
Hearing from you can make the day so much better,
Getting a souvenir or maybe a letter,
I really swooned over the grey cashmere sweater!

Freddy, you know,
Your absence makes me feel so blue.
That’s okay though,
Your presents make me think of you.
My ma will go into a rage when she catches
Those penny loafers with the aged leather patches.
Oh, how I hope you’ll send the beige skirt that matches!

Don’t keep your letters from me,
I thrill to every line.
Your spelling’s kinda crummy,
But honey, so is mine.
I treasure every gifty.
The ring is really nifty.
You say it cost you fifty,
So you’re thrifty
I don’t mind!

Freddy, you’ll see,
You’ll hold me in your arms someday,
And I will be
Wearing your lacy lingerie.
Just thinking through it, my heart’s pounding already.
Knowing when you come home, we’re bound to go steady
And throw your duty pay around like confetti!

marty
wicko3
http://johnwickerson.wordpress.com/?p=2481
Extensions
Mix-testing: revealing a new class of compiler bugs
Programming LanguagesComputer ScienceComputer Architecturearmatomicsclangcompilersconcurrencylinkingx86
I’m delighted that Luke Geeson’s work on “mix testing” (a collaboration with James Brotherston, Wilco Dijkstra, Alastair Donaldson, Lee Smith, Tyler Sorensen, and myself) will appear at OOPSLA 2024 in October.  There has been quite a lot of work on “litmus testing” of compilers in the last couple of decades. This mainly takes the form of… Continue reading Mix-testing: revealing a new class of compiler bugs →
Show full content

I’m delighted that Luke Geeson’s work on “mix testing” (a collaboration with James Brotherston, Wilco Dijkstra, Alastair Donaldson, Lee Smith, Tyler Sorensen, and myself) will appear at OOPSLA 2024 in October. 

There has been quite a lot of work on “litmus testing” of compilers in the last couple of decades. This mainly takes the form of (1) generating lots of little concurrent programs, (2) feeding them into a compiler, and then (3) seeing if the compiled code does the right thing when executed on the target machine. Luke’s idea is: what if we don’t give the whole litmus test to a single compiler, but instead we break it up into little fragments, and have each fragment compiled by a different compiler. Thus, we have the potential to find bugs that arise out of unexpected interactions between compilers.

Here’s a simple example of how a “mixing bug” might arise when targeting the x86 architecture.

Consider the following “store buffering” litmus test, written in C:

Two threads write and read to two shared locations, x and y. The C standard requires that after executing these two threads, the values in t and u must not both be zero (in other words: at least one of the threads must see the other thread’s write).

However, if this program is compiled naïvely to x86, that requirement will not be met, because when x86 processors encounter a write instruction followed by a read of a different location, they do not promise to keep them in order. Therefore, the compiler must insert a barrier between them, to prevent the potential reordering.

There are two plausible ways to accomplish this. One is for the compiler to insert a barrier after each write. That would yield something like this:

The other option is to insert a barrier before each read. That would yield something like this:

The first scheme is more popular because it tends to result in fewer barriers, given that writes are generally less common than reads. It is the scheme used by the major compilers, GCC and Clang. But either scheme is fine … as long as it is used consistently. Problems arise if the two schemes are mixed. For instance, if the first instructions in both threads are compiled using the barriers-before-reads scheme, and the second instructions in both threads are compiled using the barriers-after-writes scheme, then both compilers will wrongly assume that the other compiler is responsible for keeping the instructions in order, and no barrier will be inserted at all!

That example serves to illustrate what mixing bugs look like, but it’s not really a bug because it is already well understood that the barriers-before-reads and barriers-after-writes compilation schemes must not be mixed.

So, here is a real mixing bug. It’s a little more complicated, but the basic idea is the same: two different compilation schemes interacting in such a way that fails to enforce some necessary ordering.

Consider the same “store buffering” litmus test again, but this time compiled for Arm. Here’s how it is compiled by Clang for the Armv7 architecture:

I’ll skip the gory details of the assembly code. What’s important is how the compiler ensures that the stores and loads don’t get reordered. In this case, the compiler has inserted a barrier (DMB) at the end of all four instruction sequences. In particular, the barriers marked with the red rectangles are the ones that are important here.

Here’s how that same program is compiled by Clang for the Armv8 architecture:

This mapping is rather more efficient because it has avoided all those (expensive) barriers, and instead exploited some features that only became available in Armv8. Specifically, store-release instructions (STL) cannot be reordered with subsequent load-acquire instructions (LDA). In particular, this means that the stores and loads marked with the blue puzzle pieces cannot be reordered. I use a puzzle-piece metaphor here to indicate that these instructions only enforce ordering when they appear together. Neither instruction acts as a barrier on its own.

Now suppose we compile the first instruction in each thread for Armv8, but the second instruction for Armv7. This is perfectly legal, by the way – the resulting code should run fine on an Armv8 machine because Armv8 is fully backward-compatible with Armv7. We get something like this:

We see that neither the Armv7 nor the Armv8 mechanism for inserting a barrier is working. The Armv7 scheme relies on a barrier being inserted after the store, but the Armv8 scheme does not provide this barrier! Instead, the Armv8 scheme uses a store-release, but this is ineffective because the Armv7 scheme does not produce the corresponding load-acquire! In short, nothing is preventing the stores and loads being reordered, and we have a compiler bug.

Luke has built a tool that explores lots of possible compiler combinations for each litmus test, and he has used it to find several bugs like the one above. (That bug remains open at the time of writing, by the way.) Luke is now working with his colleagues at Arm to make their ABI more explicit about how and when various compilation schemes can be safely mixed. Hopefully, this effort will provide firmer foundations for mixed compilation in the future.

This work was supported by the Engineering and Physical Sciences Research Council (grant number EP/V519625/1). The views expressed in this blog post are not endorsed by Arm or any other company mentioned.

ai_image2
wicko3
http://johnwickerson.wordpress.com/?p=2447
Extensions
Cycling from King’s Cross to Imperial College London
Uncategorized
I’ve used Santander Cycles to get from King’s Cross to Imperial College hundreds of times over the last couple of years, and I’d like to think that I’m pretty close to finding the fastest possible route now. The basic route is to aim for New Cavendish Street and then cross Hyde Park, but I’ve found… Continue reading Cycling from King’s Cross to Imperial College London →
Show full content

I’ve used Santander Cycles to get from King’s Cross to Imperial College hundreds of times over the last couple of years, and I’d like to think that I’m pretty close to finding the fastest possible route now.

The basic route is to aim for New Cavendish Street and then cross Hyde Park, but I’ve found several tweaks that can shave off a few seconds, e.g. a quick zig-zag along Leigh Street to avoid the traffic lights at the Judd Street/Tavistock Place intersection.

The whole thing tends to take me about 27 minutes, which is handy because Santander Cycles only starts charging after 30 minutes.

I’ve optimised primarily for speed, but quite a bit of the route is on dedicated cycle lanes, and it’s not completely devoid of interest along the way – one passes by Gordon Square, Tavistock Square Gardens, University College London, the University of Westminster, Baker Street, Hyde Park, and the Royal Albert Hall.

Here’s a picture of the route from Google Maps.

Picture of a cycle route from King's Cross to Imperial College London.

The return from Imperial College to King’s Cross is basically identical, except you have to stay on the A5204 all the way from Edgware Road to Tottenham Court Road because of the one-way system.

image
wicko3
Picture of a cycle route from King's Cross to Imperial College London.
http://johnwickerson.wordpress.com/?p=2428
Extensions
Verified high-level synthesis – now with hyperblocks!
Computer ArchitectureComputer ScienceFPGAs
Yann Herklotz has added hyperblock scheduling to his verified high-level synthesis (HLS) tool called Vericert, and I’m very pleased that our paper about this work has been accepted to appear at PLDI 2024 this June. This paper was our “difficult second album”, in the sense that we’d already published a paper about the first version… Continue reading Verified high-level synthesis – now with hyperblocks! →
Show full content

Yann Herklotz has added hyperblock scheduling to his verified high-level synthesis (HLS) tool called Vericert, and I’m very pleased that our paper about this work has been accepted to appear at PLDI 2024 this June.

This paper was our “difficult second album”, in the sense that we’d already published a paper about the first version of Vericert in OOPSLA 2021, so we no longer had such a crisp “novelty” story. That is, we were no longer presenting “the first mechanically verified HLS tool”, but rather “a verified HLS tool that generates more efficient hardware designs”. I say “difficult” not because we had particular trouble getting the paper accepted – PLDI 2024 was actually the first venue we tried – but because the amount of work Yann did to add hyperblock scheduling to Vericert was about twice as much as the work he did to produce the entire first version of Vericert!

So what’s hyperblock scheduling all about, and why has it been added to an HLS tool?

As a quick primer: HLS is the process of compiling software (written in C, say) not into assembly language for a general-purpose processor, but into a design (written in Verilog, say) for a fixed-function piece of hardware that behaves like the input software. It’s popular these days because offloading computation to custom hardware accelerators is a great way to improve performance and energy efficiency, but writing those accelerators manually is laborious and error-prone. Better to specify the computation in software and then generate the hardware accelerator automatically via HLS!

Unfortunately, today’s crop of HLS tools are known to be a bit buggy, which makes them ill suited for use in a safety- or security-critical setting. This motivated the development of Vericert v1.0 – the first HLS tool mechanically proven (in Coq) never to produce a Verilog design that behaves differently from the C program it is given.

Although Vericert v1.0 had a solid correctness story, its performance was quite underwhelming compared to that of the existing (unverified) HLS tools. Its main shortcoming was that it failed to exploit the inherently parallel nature of hardware. Instead, it produced hardware that performed just one operation per clock cycle. Essentially, it took the input C program, transformed that C program into a control-flow graph (CFG), and turned that CFG into a state machine, with just one instruction per state.

More concretely, if Vericert v1.0 were given a piece of straight-line C code like this:

c = a * b;
e = c + d;
f = a * d;
g = e * b;

it would produce Verilog that looks (roughly) like this:

initial state = 0;

always @ (posedge clk)
  case (state)
    0: begin c = a * b; state = 1; end
    1: begin e = c + d; state = 2; end
    2: begin f = a * d; state = 3; end
    3: begin g = e * b; end
  endcase

and which would execute over four clock cycles (states) like this:

Now, we’d like to pack more computation into each clock cycle. We could do this really naively by packing everything into a single clock cycle. That would produce Verilog that looks (roughly) like this:

initial state = 0;

always @ (posedge clk)
  case (state)
    0: begin c = a * b; 
             e = c + d;
             f = a * d;
             g = e * b; end
  endcase

and which would execute, in a single clock cycle, like this:

Although our computation now requires fewer clock cycles, our hardware is still not good because its critical path (highlighted in pink above) is too long. That is, the clock frequency would have to be really low in order to allow time for a multiplication, then an addition, and then a second multiplication all to complete within a single cycle.

What HLS tools do in practice, then, is (1) fix a desirable target clock frequency, and then (2) try to pack as many operations into each clock cycle as possible, ensuring that no cycle has a critical path that is too long for the target frequency. This is a matter of scheduling, and it might end up producing Verilog that looks (roughly) like this:

initial state = 0;

always @ (posedge clk)
  case (state)
    0: begin c = a * b; 
             e = c + d;
             f = a * d; 
             state = 1; end
    1: begin g = e * b; end
  endcase

and which executes, in two clock cycles, like this:

This design probably offers the best compromise between minimising the number of clock cycles and maximising the clock frequency. And better still, if it can be determined that the value of c is not actually used except to hold the intermediate calculation a*b, then we can use an efficient fused multiply-add operator, like so:

We thus arrive at the first contribution of Vericert v2.0: list scheduling. This is an optimisation that takes each basic block of the C code, and decides how best to chunk it up into clock cycles, taking account of the estimated delay of each operation.

However, existing HLS tools go further than this.

The thing is, list scheduling only deals with straight-line code, like in our example. We’d prefer also to optimise code that has more interesting control flow, such as if-statements, like this:

f = a * d;
if (d > 0)
  c = a * b;
else
  g = b * c;
e = c + d;

List scheduling would approach that code as four separate basic blocks, each requiring a separate state in the Verilog state machine. However, if we perform a process called if-conversion then we end up with what is known as a hyperblock. A hyperblock is a basic block where each instruction is guarded by a predicate, and an instruction executes only if its predicate evaluates to true. If-conversion on the code above might yield the following hyperblock:

true    => f = a * d;
d > 0   => c = a * b;
d <= 0  => g = b * c;
true    => e = c + d;

We’re back to straight-line code, and can hence can apply our scheduling approach once again to see how many operations we can safely pack into each clock cycle. But our scheduler must now be a hyperblock scheduler, which means that it must deal with predicates correctly. For instance, it must take account of the fact that the predicates d > 0 and d <= 0 are mutually exclusive, so the assignments to c and g can’t both occur, and can hence be scheduled for the same clock cycle (and even for the same multiplication unit).

So that’s the essence of Vericert v2.0 – Vericert v1.0 with a hyperblock scheduler, and everything still mechanically proven correct in Coq. Vericert v2.0 is still not truly competitive with state-of-the-art unverified HLS tools, as there are still many more optimisations that it doesn’t perform, but our performance evaluation indicates that when a state-of-the-art open-source HLS tool called Bambu is restricted to use only the optimisations that Vericert v2.0 knows about, the two tools perform quite similarly. And what’s nice about Yann’s proof is that it is completely independent of how the scheduler is implemented – he confirms that each schedule produced is correct, using a Coq-verified checker, but carefully ignores how the scheduler works. This means that the scheduler can be tweaked over time without needing to revisit the correctness proof.

screenshot-2024-03-01-at-13.58.58
wicko3
http://johnwickerson.wordpress.com/?p=2409
Extensions
What is the other Incorrectness logic?
Computer ScienceProgramming Languages
The Hoare triple has a very simple meaning, namely: . That is: if the precondition is satisfied in some state , and can transform into , then must satisfy the postcondition . (We’ll allow non-deterministic commands, so need not be uniquely determined by and .) Or, more concisely: if holds before executing , then will… Continue reading What is the other Incorrectness logic? →
Show full content

The Hoare triple \{P\}\,c\,\{Q\} has a very simple meaning, namely:

\forall\sigma, \sigma'\ldotp \sigma \in P \wedge (\sigma,\sigma') \in c \longrightarrow \sigma'\in Q.

That is: if the precondition P is satisfied in some state \sigma, and c can transform \sigma into \sigma', then \sigma' must satisfy the postcondition Q. (We’ll allow non-deterministic commands, so \sigma' need not be uniquely determined by c and \sigma.) Or, more concisely: if P holds before executing c, then Q will hold afterwards.

The statement above can be reformulated in a couple of different ways, all equivalent. One way is to introduce Dijkstra’s concept of a weakest precondition, like so:

pre(c,Q) = \{\sigma\mid \forall \sigma'\ldotp (\sigma,\sigma') \in c \longrightarrow \sigma'\in Q\}

and then observe that \{P\}\,c\,\{Q\} is the same as P \subseteq pre(c,Q). Another way is to introduce the concept of a strongest postcondition, also due to Dijkstra, like so:

post(P,c) = \{\sigma'\mid \exists \sigma\ldotp \sigma\in P \wedge (\sigma,\sigma') \in c\}

and then observe that \{P\}\,c\,\{Q\} is the same as post(P,c) \subseteq Q.

In 2019, O’Hearn introduced Incorrectness logic. This logic involves triples [P]\,c\,[Q] that have a subtly different meaning to Hoare’s, namely:

\forall\sigma'\ldotp \sigma' \in Q \longrightarrow \exists \sigma\ldotp \sigma \in P \wedge (\sigma,\sigma') \in c.

That is: if the postcondition Q is satisfied by some state \sigma', then there must exist some state \sigma that can be transformed into \sigma' by command c and that satisfies the precondition P. Or, more concisely: every state where Q holds is reachable by executing c from a state where P holds.

There’s a neat duality between Incorrectness logic and Hoare logic:

\{P\}\,c\,\{Q\} is equivalent to post(P,c) \subseteq Q

[P]\,c\,[Q] is equivalent to post(P,c) \supseteq Q

What’s curious is that this \subseteq/\supseteq trick can only be used with the strongest-postcondition formulation. If you try it on the weakest-precondition formulation, you don’t get Incorrectness logic. In other words, P \subseteq pre(c,Q) and post(P,c) \subseteq Q are logically equivalent, but P \supseteq pre(c,Q) is not the same as post(P,c) \supseteq Q. If post(P,c) \supseteq Q corresponds to Incorrectness logic, then what does P \supseteq pre(c,Q) correspond to?

So let’s see what happens if we introduce a third type of triple, \langle P\rangle\,c\,\langle Q\rangle, which shall mean:

\forall \sigma\ldotp (\forall \sigma' (\sigma,\sigma') \in c \longrightarrow \sigma' \in Q) \longrightarrow \sigma \in P.

That is: if \sigma is a state that is always transformed by command c into one that satisfies the postcondition Q, then \sigma must satisfy the precondition P. Or, more concisely: if P doesn’t hold before executing c, then Q will not hold afterwards.

Essentially, we are filling in the gap marked “???” in the picture below (where the thick arrows indicate equivalences):

What can we say about these new \langle P \rangle c \langle Q \rangle triples? What properties do they enjoy?

In some ways, all three forms of triple are alike. For instance, they all admit the same Sequencing rule:

\displaystyle\frac{ \{ P \} c_1 \{ Q \} \quad \{ Q \} c_2 \{ R \} } { \{ P \} c_1 ; c_2 \{ R \} }

\displaystyle\frac{ [ P ] c_1 [ Q ] \quad [ Q ] c_2 [ R ] } { [ P ] c_1 ; c_2 [ R ] }

\displaystyle\frac{ \langle P \rangle c_1 \langle Q \rangle \quad \langle Q \rangle c_2 \langle R \rangle } { \langle P \rangle c_1 ; c_2 \langle R \rangle }

and they all admit the Iteration rule:

\displaystyle\frac{ \{ P \} c \{ P \} } { \{ P \} c^* \{ P \} }  \displaystyle\frac{ [ P ] c [ P ] } { [ P ] c^* [ P ] }  \displaystyle\frac{ \langle P \rangle c \langle P \rangle } { \langle P \rangle c^* \langle P \rangle }

In some ways, the new triples behave more like Incorrectness triples than like Hoare triples. This is the case for the Skip rules:

\displaystyle\frac{ P \Rightarrow Q } { \{ P \} \text{skip} \{ Q \} }  \displaystyle\frac{ Q \Rightarrow P } { [ P ] \text{skip} [ Q ] }  \displaystyle\frac{ Q \Rightarrow P } { \langle P \rangle \text{skip} \langle Q \rangle }

and for the Consequence rules:

\displaystyle\frac{ P' \Rightarrow P \quad \{ P \} c \{ Q \} \quad Q \Rightarrow Q' } { \{ P' \} c \{ Q' \} }

\displaystyle\frac{ P \Rightarrow P' \quad [ P ] c [ Q ] \quad Q' \Rightarrow Q } { [ P' ] c [ Q' ] }

\displaystyle\frac{ P \Rightarrow P' \quad \langle P \rangle c \langle Q \rangle \quad Q' \Rightarrow Q } { \langle P' \rangle c \langle Q' \rangle }

In yet other ways, all three forms of triple are different. For instance, they all say different things about the command \varnothing that always diverges (fails to terminate):

  • \{ P \} \varnothing \{ Q \} always holds, because Hoare logic gives no guarantees about non-terminating executions.
  • [ P ] \varnothing [ Q ] holds iff Q = False, because Incorrectness logic demands termination.
  • \langle P \rangle \varnothing \langle Q \rangle holds iff P = True, because \varnothing will always fail to reach Q.

Another way to distinguish the three forms is to ask what they say when the postcondition is unsatisfiable. (In what follows, let diverges(c) be the set of states from which execution of c diverges.)

  • \{ P \} c \{ False \} holds iff P \Rightarrow diverges(c); i.e. P is a sufficient condition for c diverging.
  • [ P ] c [ False ] always holds.
  • \langle P \rangle c \langle False \rangle holds iff diverges(c) \Rightarrow P; i.e. P is a necessary condition for c diverging.

Here are some rules about c_1 \cup c_2, which denotes a non-deterministic choice between commands c_1 and c_2:

\displaystyle\frac{ \{ P_1 \} c_1 \{ Q_1 \} \quad \{ P_2 \} c_2 \{ Q_2 \} } { \{ P_1 \wedge P_2 \} c_1 \cup c_2 \{ Q_1 \vee Q_2 \} }

\displaystyle\frac{ [ P_1 ] c_1 [ Q_1 ] \quad [ P_2 ] c_2 [ Q_2 ] } { [ P_1 \vee P_2 ] c_1 \cup c_2 [ Q_1 \vee Q_2 ] }

\displaystyle\frac{ \langle P_1 \rangle c_1 \langle Q_1 \rangle \quad \langle P_2 \rangle c_2 \langle Q_2 \rangle } { \langle P_1 \wedge P_2 \rangle c_1 \cup c_2 \langle Q_1 \wedge Q_2 \rangle }

Here are some rules for conjoining or disjoining two different specifications for the same command:

\displaystyle\frac{ \{ P_1 \} c \{ Q_1 \} \quad \{ P_2 \} c \{ Q_2 \} } { \{ P_1 \vee P_2 \} c \{ Q_1 \vee Q_2 \} }

\displaystyle\frac{ \{ P_1 \} c \{ Q_1 \} \quad \{ P_2 \} c \{ Q_2 \} } { \{ P_1 \wedge P_2 \} c \{ Q_1 \wedge Q_2 \} }

\displaystyle\frac{ [ P_1 ] c [ Q_1 ] \quad [ P_2 ] c [ Q_2 ] } { [ P_1 \vee P_2 ] c [ Q_1 \wedge Q_2 ] }

\displaystyle\frac{ \langle P_1 \rangle c \langle Q_1 \rangle \quad \langle P_2 \rangle c \langle Q_2 \rangle } { \langle P_1 \wedge P_2 \rangle c \langle Q_1 \wedge Q_2 \rangle }

Here are some ways to translate between Hoare triples and Incorrectness triples. (In what follows, let determ(c) be the set of states from which execution of c is deterministic, and let c^{-1} denote the inverse of the command c.)

  • If P \cap diverges(c) = \emptyset then \{ P \} c \{ Q \} implies [Q] c^{-1} [P].
  • If P \subseteq determ(c) then [Q] c^{-1} [P] implies \{ P \} c \{ Q \}.
  • Hence, if P \cap diverges(c) = \emptyset and P \subseteq determ(c) then \{ P \} c \{ Q \} coincides with [Q] c^{-1} [P].

Here are some ways to translate between Hoare triples and the new form of triples.

  • If P \cap diverges(c) = \emptyset then \{ P \} c \{ Q \} implies \langle \neg P \rangle c \langle \neg Q \rangle.
  • If P \subseteq determ(c) then \langle \neg P \rangle c \langle \neg Q \rangle implies \{ P \} c \{ Q \}.
  • Hence, if P \cap diverges(c) = \emptyset and P \subseteq determ(c) then \{ P \} c \{ Q \} coincides with \langle \neg P \rangle c \langle \neg Q \rangle.

If we become a bit more concrete now, and model program states simply as functions from variables to integers, then we can obtain specifications of the “increment x” command like so:

  • \{ x = n \}\,{+}{+}x \,\{x = n + 1\}
  • [ x = n ] \, {+}{+}x \, [ x = n + 1 ]
  • \langle x = n \rangle \, {+}{+}x \, \langle x = n + 1 \rangle

or:

  • \{ x > n \} \, {+}{+}x \, \{x > n + 1\}
  • [ x > n ] \, {+}{+}x \, [ x > n + 1 ]
  • \langle x > n \rangle \, {+}{+}x \, \langle x > n + 1 \rangle

For specifying an “assign a constant” command, we can use:

  • \{ x > 2 \} \, x := 5 \, \{ x = 5 \}
  • [ x > 2 ] \, x := 5 \, [ x = 5 ]

but \langle x > 2 \rangle \, x := 5 \, \langle x = 5 \rangle is invalid because x>2 is not necessary to guarantee x = 5 after the assignment. In fact, any initial state will suffice here, which means \langle True \rangle \, x := 5 \, \langle x = 5 \rangle is valid.

We can also specify that assignment command more loosely:

  • \{ x > 2 \} \, x := 5 \, \{ x > 3 \}

but [ x > 2 ] \, x := 5 \, [ x > 3 ] is invalid because executing x := 5 can never reach a final state where x = 7, say. And \langle x > 2 \rangle \, x := 5 \, \langle x > 3 \rangle is invalid too, for the same reason as given above.

Finally, we can specify some non-deterministic commands, such as a command that assigns to x either 4 or 5:

  • \{ True \} \, (x := 4) \cup (x := 5) \, \{ x = 4 \vee x = 5 \}
  • [ True ] \, (x := 4) \cup (x := 5) \, [ x = 4 \vee x = 5 ]
  • \langle True \rangle \, (x := 4) \cup (x := 5) \, \langle x = 4 \vee x = 5 \rangle

or a command that either increments x or does nothing:

  • \{ x > n \} \, ({+}{+}x) \cup \text{skip} \, \{ x > n \}
  • [ x > n ] \, ({+}{+}x) \cup \text{skip} \, [ x > n ]
  • \langle x > n \rangle \, ({+}{+}x) \cup \text{skip} \, \langle x > n \rangle

In terms of related work:

  • O’Hearn does mention the curious inequivalence between P \supseteq pre(c,Q) and post(P,c) \supseteq Q in his POPL’19 paper (near the end of Section 7), but doesn’t dig into the implications of building a proof system around triples that mean P \supseteq pre(c,Q).
  • He does, however, point out some earlier work by Cousot et al. that made use of these triples in an automated verification context, which was published in VMCAI 2013. Cousot et al. referred to them as necessary preconditions, and they used them because they wanted to avoid false positives in their automatic tool. However, they didn’t develop a set of proof rules for reasoning about these triples.
  • More recently, Zilberstein, Dreyer, and Silva’s OOPSLA 2021 paper proposes Outcome logic as a way to unify \{P\}\,c\,\{Q\} triples and [P]\,c\,[Q] triples into a single logic based on monads and monoids. I haven’t yet figured out whether Outcome logic also covers \langle P\rangle \,c\,\langle Q\rangle triples.

To conclude: \langle P\rangle \,c\,\langle Q\rangle gives rise to a proof system that seems to have some interesting properties, and perhaps it shines some light on the nature of Hoare logic and Incorrectness logic by where it differs from them. I’m not convinced that it’s a useful proof system like Hoare logic or Incorrectness logic certainly are. Still, it’s always satisfying to fill in “gaps”!

An Isabelle-2023 proof script that formalises all of the results given above is available.

screenshot-2024-02-15-at-10.42.36
wicko3
http://johnwickerson.wordpress.com/?p=2305
Extensions
CXL: What’s all the fuss about?
Computer ArchitectureComputer Science
The world of computer architecture is quite excited these days about something called Compute Express Link (CXL). It’s a new standard that allows the various components of a datacentre computer to communicate large amounts of data efficiently with each other. In this article, I’ll explain what CXL is, and why there is so much excitement… Continue reading CXL: What’s all the fuss about? →
Show full content

The world of computer architecture is quite excited these days about something called Compute Express Link (CXL). It’s a new standard that allows the various components of a datacentre computer to communicate large amounts of data efficiently with each other.

In this article, I’ll explain what CXL is, and why there is so much excitement about the possibilities that it opens.

I’ll start by talking about the situation before CXL came along.

The Way Things Are

We are in the age of parallelisation and specialisation. That is, it’s no longer practical to push all our computation through a single processor that runs very quickly; instead, we have to spread our computation out across several processors, all running in parallel. These various processors need not be identical; indeed, there are advantages to having different kinds of processors, all specialised for different tasks, and arranging that the overall computation is split up so as to play to the strengths of the available processors. For instance, a CPU might be well suited for handling user interaction, while the task of iterating through a large array might be handed off to a GPU.

Of course, now that we have spread out our processors in this way, we need to ensure that the data that our processors need is in the right place at the right time. And since we are also in the age of Big Data, we have a lot of data to move around.

It’s not practical to keep all data in a single memory — the delay associated with each read or write would be intolerable. CPUs tackle this problem by using caches: locations that are likely to be needed soon are copied into small ‘cache’ memories that are fast to access. Cache-coherence protocols are employed to ensure that data in those copies remains in-sync with the contents of the main memory.

Cache-coherence works nicely within a single CPU, but what about our heterogeneous system made up of CPUs, GPUs, and other devices?

Let’s consider a simple heterogeneous system made up of a CPU and a GPU, as shown in the picture below.

There are a couple of ways for the GPU to access data in the CPU’s memory. The traditional way is to copy data en-masse from the CPU’s memory into the GPU’s memory, and then to copy it back once the computation is complete. But this can involve an intolerable amount of data movement, much of which may be unnecessary.

An alternative is to give the GPU direct access to the locations in the CPU’s memory that it actually needs to read or write. This can reduce the amount of data flowing between the processors, but each individual memory access by the GPU now becomes very slow, because it has to go all the way to the CPU’s memory to get the freshest data. It cannot rely on copies of CPU memory locations that it may have stored in its own caches, because they may contain out-of-date values. After all, as shown in the picture above, the GPU is not part of the CPU’s cache-coherence domain, so it is not notified if the CPU overwrites a memory location that the GPU has cached.

The usual workaround is to ensure that the CPU does not access any of the memory locations to which the GPU has access until the GPU has finished with them. Naturally, this restricts what the CPU is able to do while the GPU is performing its task.

Enter CXL

The key idea of CXL is to extend the cache-coherence domain so that it covers not just the CPU, but all of the connected processors too. This gives us the following picture.

Why is this a good thing? Well, the CPU and the GPU can now cache any location in the CPU’s memory, safe in the knowledge that the cache-coherence protocol will notify them if the cached data goes out-of-date. There is no longer any need for the CPU to avoid memory locations that the GPU has been given access to.

So why wasn’t this done before? Well, it kinda was. For instance, AMD developed Infinity Fabric, which is used to provide a cache-coherent interconnect between AMD CPUs and AMD GPUs on Frontier, the fastest supercomputer in the world since 2020. Before that, NVIDIA developed NVLink, which did the same for IBM POWER9 CPUs and NVIDIA GPUs in Summit, the fastest supercomputer in the world before 2020. OpenCAPI is another cache-coherent interconnect protocol for IBM CPUs. The problem is that all of these protocols are vendor-specific. It is not possible to connect, for instance, an AMD CPU and an NVIDIA GPU, because the cache-coherence mechanisms used by one device are not meaningful to the other device.

One attempt that came close to being a vendor-independent standard was CCIX (Cache-Coherent Interconnect for Accelerators), which arose in the mid-2010s. It had the backing of several big players in the industry (AMD, Xilinx, Huawei, and Arm) but lacked broader support, most notably from Intel, and is now considered dead. CXL was proposed by Intel in 2019, and quickly attained broad support.

A Killer Application: Memory Pooling

As I’ve described above, CXL promises to make it easier for CPUs, GPUs, and other devices to communicate large amounts of data efficiently with each other.

But the CXL application that has the most industry buzz around it is called memory pooling.

Given that CXL can be used to allow devices to quickly access each other’s memory, the question arises: can we use CXL to connect devices that solely contain memory?

The picture above is enticing to a datacentre operator, because it shows how memory and processors can be disentangled. In our previous setup, each processor had a fixed amount of memory attached to it. This fixed amount had to be rather large, in order to suffice for the most memory-intensive workloads that the processor handles. But large amounts of memory are expensive, and anyway, most of that memory is likely unused most of the time.

But once we have CXL-connected memory, then this memory can be used by any processor that is also CXL-connected. It can be allocated to one processor for a while, then returned to the ‘pool’, and then allocated to a different processor that needs it. This can result in the system requiring far less memory attached to each processor, and hence far less memory overall, and hence substantial cost savings for the datacentre operator.

Sources and Further Reading
screenshot-2023-08-24-at-14.52.33
wicko3
http://johnwickerson.wordpress.com/?p=2283
Extensions
The fastest route between King’s Cross St Pancras and South Kensington
Uncategorized
I’ve commuted via King’s Cross St Pancras to Imperial College’s South Kensington Campus for about a decade now, but have only recently discovered what I believe to be the fastest possible route between those two London stations. King’s Cross St Pancras to South Kensington takes me about 20 minutes, and the return journey takes about… Continue reading The fastest route between King’s Cross St Pancras and South Kensington →
Show full content

I’ve commuted via King’s Cross St Pancras to Imperial College’s South Kensington Campus for about a decade now, but have only recently discovered what I believe to be the fastest possible route between those two London stations. King’s Cross St Pancras to South Kensington takes me about 20 minutes, and the return journey takes about 17.

From King’s Cross St Pancras to South Kensington

I’ll assume you’re standing outside King’s Cross railway station. Enter the Underground station via the entrance shown in the photo below.

Starting point outside King’s Cross railway station.

There are two things you need to know about the King’s Cross St Pancras Underground station: one is that they built a new ticket hall in 2009 (the ‘northern’ ticket hall), and the other is that you should never use it. Wherever you’re coming from and wherever you’re going, if you find yourself in the northern ticket hall then you’ve gone the long way round.

So although you’re aiming for the Victoria line, you shouldn’t actually follow the signs that tell you to go right for the Victoria line. They’ll take you via the northern ticket hall. Instead, go left.

The sign says go right for the Victoria line, but you should go left.

This will take you to another ticket hall (the ‘Tube’ ticket hall). From this point, you can safely follow the signs for the Victoria line (southbound). Once you step onto the platform, don’t be tempted to move; you’ll want to get on the train at its frontmost door.

Stand here on the southbound Victoria line platform.

Get off the train at Victoria. You are now aiming for the District and Circle lines. In front of you is a sign inviting you to walk back along the platform if you want the District and Circle lines. It is a lie. Ignore it, and follow the signs for National Rail instead.

This sign is a lie.

You will soon see another sign insisting that you are on a route with no access to the District and Circle lines. It is also a lie. Ignore it, and proceed up the escalator in front of you.

This sign is also a lie.

At the top of the escalator, turn left, then left again to go down another escalator. You can now follow signs for the District and Circle lines (westbound).

Follow the signs for the District and Circle lines (westbound).

You’ll want to board the train at the front of the third car. This requires you to walk left along the platform until the ground under your feet looks like the photo below.

Stand here on the westbound District and Circle lines platform.

When you get off the train at South Kensington you should find the stairs directly in front of you. (Unfortunately this station has no step-free exit.)

From South Kensington to King’s Cross St Pancras

Starting at the ticket hall of the South Kensington Underground station, take the stairs down to the District and Circle lines platform. You’re heading eastbound, so keep left at the bottom of the stairs. Keep walking straight ahead, until you’re most of the way along the platform. You’ll want to board the train at the back of the second car, which means you should stand outside the green door of the Customer Service Manager’s office.

Stand here on the eastbound District and Circle lines platform.

Get off the train at Victoria and follow the signs for the Victoria line (northbound). When you step onto the platform you’ll want to turn left and head all the way down to the far end, so you can board the train at the very rear. This puts you in the right position to make a quick exit at King’s Cross St Pancras, being sure – as always – to avoid that northern ticket hall!

Screenshot 2023-05-11 at 16.31.45
wicko3
http://johnwickerson.wordpress.com/?p=2254
Extensions
Hyphenation: when and why?
Uncategorized
When writing multi-word adjectives, hyphens can be important. But many writers use either too many or too few. This blog post is my attempt to clarify the situation (as I understand it). The basic issue is that “adjective noun noun” can be hard for readers to parse. The default is to understand it as (adjective… Continue reading Hyphenation: when and why? →
Show full content

When writing multi-word adjectives, hyphens can be important. But many writers use either too many or too few. This blog post is my attempt to clarify the situation (as I understand it).

The basic issue is that “adjective noun noun” can be hard for readers to parse. The default is to understand it as (adjective (noun noun)). For instance, “benign data race” would usually be read as “a data race that is benign” rather than “a race on benign data”.

So, if you want to indicate the other parsing, ((adjective noun) noun), you should add a hyphen. For instance, “shared-memory operation” is parsed as “an operation on shared memory” rather than “a memory operation that is shared”.

Note that “adverb adjective noun” doesn’t have this problem, because (adverb (adjective noun)) isn’t a valid parsing. For instance, a “weakly consistent system” has to be a “system that is weakly consistent” because a “consistent system that is weakly” doesn’t make grammatical sense. So there’s no need for hyphenation here.

Featured photo by Aaron Burden on Unsplash.

aaron-burden-y02jEX_B0O0-unsplash
wicko3
http://johnwickerson.wordpress.com/?p=2249
Extensions
Proving the existence of God using a computer
Computer Science
My friend and colleague George Constantinides wrote an interesting post on his blog recently where he formalises St Anselm’s argument for the existence of God. Since I’m teaching a course on the Isabelle proof assistant this term, I thought I’d see if I could recreate that argument using Isabelle. Here goes… As can be seen… Continue reading Proving the existence of God using a computer →
Show full content

My friend and colleague George Constantinides wrote an interesting post on his blog recently where he formalises St Anselm’s argument for the existence of God. Since I’m teaching a course on the Isabelle proof assistant this term, I thought I’d see if I could recreate that argument using Isabelle.

Here goes…

theory God imports Main begin

(* We assume a fixed set of "beings". *)
typedecl Being

theorem 
  (* Some beings exist in reality. The others
  exist only in the mind. *)
  fixes exists_in_reality :: "Being set"

  (* Suppose that the idea of "greatness" imposes
  a (strict) partial order on the set Being. That is, 
  some beings are greater than other beings. The 
  order is only partial, which means that it is 
  not always possible to compare the greatness of
  two beings. NB: strict partial order = 
  transitive + irreflexive. *)
  fixes greater_than :: "Being rel"
  assumes "trans greater_than"
  assumes "irrefl greater_than"

  (* We shall say that a being is godlike if
  there is no being that is greater than it. *)
  fixes godlike :: "Being set"
  assumes "∀b ∈ godlike. 
    ∄b'. (b',b) ∈ greater_than"

  (* The concept of god exists in the mind and 
  so is a member of Being. *)
  assumes "godlike ≠ {}"

  (* If two beings are identical except that one 
  exists in reality and one does not, then the 
  one that exists in reality is greater than the 
  one that doesn't. I've formalised a weaker 
  version: for every being that does not exist 
  in reality, there exists another being that 
  does exist in reality, and is greater. *)
  assumes "∀b. b ∉ exists_in_reality ⟶ 
    (∃b' ∈ exists_in_reality. 
      (b',b) ∈ greater_than)"

  (* God exists. *)
  shows "∃g ∈ godlike. g ∈ exists_in_reality"
  using assms by blast

end

As can be seen by the brevity of the final “using assms by blast” line, Isabelle didn’t have much trouble accepting this argument once all of the premises were laid out precisely. Then again, as for whether all of those premises are reasonable, I’m not so sure! I think St Anselm was taking some liberties in his original version, and I’m sure I’ve introduced some oversimplications in my own formulation.

Perhaps this is a theorem where a constructive proof would be more convincing…

1-gstQdHf1IV93WPoBi5d9nA-1200x559
wicko3
http://johnwickerson.wordpress.com/?p=2228
Extensions
“A Squash and a Squeeze” – Academic Edition
Academia
A little old lady worked all on her own On a paper about a result she had shown. A wise old man heard her grumble and rage: "There’s not enough room on my page! Wise old man, won’t you help me, please? My paper’s a squash and a squeeze."… Continue reading “A Squash and a Squeeze” – Academic Edition →
Show full content

With apologies to Julia Donaldson and Axel Scheffler

A little old lady worked all on her own
On a paper about a result she had shown.
A wise old man heard her grumble and rage:
“There’s not enough room on my page!
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Bring in more proofs,” said the wise old man. 
“Bring in more proofs? What a curious plan.”
So she added a proof of each lemma in turn
But found that this led to increasing concern.
The little old lady cried, “God in his heaven!
My paper was ten pages; now it’s eleven!
Printing this out will use twenty-three trees!
My paper’s a squash and a squeeze! 
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Bring in more graphs,” said the wise old man. 
“Bring in more graphs? What a curious plan.”
So she added some bar charts and one scatter plot
But found that this lengthened her work quite a lot.
The little old lady cried, “Though I have tried,
My paper’s now entered its seventeenth side!
Should I remove all the words that have e’s?
My paper’s a squash and a squeeze! 
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Bring in more quotes,” said the wise old man. 
“Bring in more quotes? What a curious plan.”
So she added a quote and it made her work stronger,
But, lo and behold, her paper got longer!
The little old lady cried, “Please help me trim it,
I need to get down to my editor’s limit.
I added quotations and, well, quelle surprise:
My paper’s a squash and a squeeze! 
Wise old man, won’t you help me, please?
My paper’s a squash and a squeeze.”

“Take them all out,” said the wise old man.
“But then I’ll be back where I first began!”
So she made an appendix for all of her proofs
(Which anyway probably had a few goofs),
She chopped out the graphs and replaced them with tables
(She never did like all those x-axis labels),
She took out the quotes, and she had a good stare:
“I’m under the limit with four lines to spare!”
“Thank you, old man, for your canny suggestions!
Your strategy certainly raised a few questions,
But worked like a treat. So when I next edit
My paper I’ll give you appropriate 

Credit.”

Screenshot 2022-07-14 at 09.25.01
wicko3
http://johnwickerson.wordpress.com/?p=2214
Extensions
Some reflections on FCCM 2022
AcademiaFPGAs
FCCM 2022 in New York City has just drawn to a close, so I thought I’d put down some thoughts about my experience while it’s fresh in my mind. As my first in-person conference since covid, my expectations were sky high, and I’m very pleased to report that my expectations were well and truly met… Continue reading Some reflections on FCCM 2022 →
Show full content

FCCM 2022 in New York City has just drawn to a close, so I thought I’d put down some thoughts about my experience while it’s fresh in my mind. As my first in-person conference since covid, my expectations were sky high, and I’m very pleased to report that my expectations were well and truly met – I’ve had a fantastic time. It was really wonderful to re-connect with colleagues and collaborators from across the world, and to make new connections too. I really enjoyed the highly interactive “FLASHLIGHT” workshop on formal methods and high-level synthesis that I co-organised as a satellite event. The main conference had an interesting and varied programme, and it was so nice to be able to engage in direct conversation with the authors both in the Q&As after the talks and in the coffee breaks afterwards. Here is a handful of papers that have stuck in my mind:

  • Aman Arora presented a new design of FPGA in which block RAMs are augmented with their own built-in processing elements. The processing elements are not particularly exciting in themselves – for instance, they can only manage to process two or three bits at a time – but they have fantastically low latency because the data doesn’t need to travel across the FPGA fabric, from the RAM to the logic blocks and back again. An interesting challenge for the hardware compilers of the near future will be to work out which parts of the computation should be mapped to these low-latency, low-throughput “CoMeFa” RAMs, and which parts should remain in the traditional logic blocks.
  • Ecenur Üstün explained that there are several ways to decompose high-bitwidth integer multiplications into multiple lower-bitwidth operations, and presented her tool for exploring that design space using a technique called “equality saturation”.
  • Lana Josipović explained how to add resource sharing to her dynamically-scheduled HLS tool, in order to make the circuits it produces consume less area on the FPGA fabric. This is easy in the static-scheduling setting because you can see at compile-time that a given pair of operations are never executed simultaneously and hence can be time-multiplexed onto the same functional unit; in the dynamic-scheduling setting, this information doesn’t become apparent until run-time, which is too late!
  • Funnily enough, my student Michalis Pardalos‘s presentation was also about adding resource sharing to an HLS tool; on the one hand, his setting is easier than Lana’s because he assumes static scheduling, but on the other hand, his setting was more challenging because he is working on a computer-checked mathematical proof that the resource sharing is being carried out correctly.
  • Another of my students, Jianyi Cheng, showed how to extend HLS with a technique called C-slow pipelining, which can be thought of as a hardware pipeline with a cycle in it. An audience member expressed appreciation for Jianyi’s carefully crafted Powerpoint animations of these cyclic pipelines in action, which was heartening to see.
  • Nicholas Beckwith explained that HLS tools tend to generate designs with quite “spread out” memory access patterns; this is fine for ordinary RAM, but leads to strikingly poor performance when using persistent memory like Intel’s Optane DC, which accesses memory at the granularity of a “block” at a time. Nicholas showed how performance can be regained by re-mapping memory locations so as to minimise the number of blocks that need accessing.

FCCM 2022 was the first “hybrid” conference I’ve attended. Roughly half of its 275 participants were in-person and half were online. As for the talks themselves: about 15 were in-person and 10 were delivered live over Zoom. (One of the online talks was actually meant to be in-person, but the speaker self-quarantined upon arrival in NYC after coming down with flu-like symptoms.) There were just a handful of questions from remote participants throughout the whole conference, each of which was read out by the session chair.

It was notable that our venue (Cornell Tech) was extremely well equipped for a hybrid conference, with top-of-the-range sound systems, oodles of microphones of all shapes and sizes, lecterns with built-in Zoom support, and high-resolution cameras pointing at the speaker and at the audience. The system was fantastically complicated and required seemingly constant attention from the small army of hardworking volunteers. I suspect that a smaller-scale conference at a less modern venue would struggle.

The hybridisation was clearly an enormous effort for the organisers, but I was struck by how little “overhead” the hybridisation imposed on the in-person participants. We had to remember to speak into the microphone during the Q&As so that the remote participants could hear us, but other than that, there was barely any indication that this was a hybrid conference.

IMG_5740
wicko3
http://johnwickerson.wordpress.com/?p=2199
Extensions
The “Cluedo” effect in randomised testing
Computer Science
When playing the board game Cluedo (also known as Clue in North America), you make a series of hypotheses that consist of a murderer, a weapon, and a room. In each hypothesis, the room has to be the room your piece is currently in, and the player to your left always has the first opportunity… Continue reading The “Cluedo” effect in randomised testing →
Show full content

When playing the board game Cluedo (also known as Clue in North America), you make a series of hypotheses that consist of a murderer, a weapon, and a room. In each hypothesis, the room has to be the room your piece is currently in, and the player to your left always has the first opportunity to invalidate your hypothesis by showing you one of their cards.

So, it’s quite infuriating when you discover that the player to your left owns the card of the room you’re in, because once they’ve shown you that card, you’re not going to get any more information — either from them or from any other player — until you’ve moved your piece out of that room. That’s because even if you change the murderer or the weapon in your hypothesis, they’re (presumably) just going to keep showing you their room card.

The situation is much the same with randomised testing, or fuzzing.

Once you’ve found one bug in the software-under-test, it’s often the case that you’ll keep finding that same bug again and again, even with different random test-cases, especially if the bug is in a commonly used feature. This phenomenon, which has been written about by Hughes et al. among others, can mask any other bugs that might be present. In this situation, the analogue to “moving your Cluedo piece out of the room” is either: (1) identifying the buggy feature and making sure that subsequent test-cases avoid it, or (2) fixing the bug in the software-under-test before proceeding.

Thanks to David MacIver for pointing me to Hughes et al.’s paper.

Video game cassette, u2018Cluedo: The great detective gameu2019 (video game)
wicko3
http://johnwickerson.wordpress.com/?p=2189
Extensions
High-level synthesis, but correct
AcademiaComputer ScienceFPGAsProgramming Languages
This post is about a paper by Yann Herklotz, James Pollard, Nadesh Ramanathan, and myself that will be presented shortly at OOPSLA 2021. High-level synthesis (HLS) is an increasingly popular way to design hardware. It appeals to software developers because it allows them to exploit the performance and energy-efficiency of custom hardware without having to learn… Continue reading High-level synthesis, but correct →
Show full content

This post is about a paper by Yann Herklotz, James Pollard, Nadesh Ramanathan, and myself that will be presented shortly at OOPSLA 2021.

High-level synthesis (HLS) is an increasingly popular way to design hardware. It appeals to software developers because it allows them to exploit the performance and energy-efficiency of custom hardware without having to learn Verilog, and it appeals to hardware engineers because it offers them the convenient abstractions of high-level software languages like C.

However, there are concerns about its reliability. For instance, Andrew Canis, co-founder of LegUp Computing, has described the HLS process as “inherently prone to introducing bugs or regressions in the final circuit functionality”. Indeed, our own recent work gave credence to those concerns by uncovering several instances where mainstream HLS tools really do compile C programs into incorrect Verilog designs. This lack of reliability undermines the usefulness of HLS in safety-critical contexts.

To address this problem, we have built a new open-source HLS tool, called Vericert, that is mechanically proven correct. This means that whenever Vericert compiles a C program into a Verilog design, the Verilog design is guaranteed to behave in exactly the same way as the C program. To achieve this, we extended an existing verified C compiler called CompCert with a new hardware-specific intermediate language and a new Verilog backend. To phrase Vericert’s correctness theorem, we rely on a Verilog semantics developed by Lööw and Myreen, which we lightly extended to handle a few more language features that we needed. Vericert currently supports most C constructs, including integer operations, function calls (all inlined), local arrays, structs, unions, and general control-flow statements, but currently excludes support for case statements, function pointers, recursive function calls, non-32-bit integers, floats, and global variables.

We used the PolyBench/C benchmark suite to evaluate the performance of Vericert in comparison to an existing open-source HLS tool called LegUp. Area-wise: Vericert fares quite well, generating designs that are only around 10% larger than those generated by LegUp. Speed-wise: we initially found Vericert’s designs to run about 27x slower (!) than LegUp’s. This can mostly be blamed on Vericert’s naive strategy of cramming each operation into a single clock cycle, because whenever complicated operations like divisions are present, the clock has to run really slowly to give them time to complete. After replacing all the divisions in PolyBench/C with iterated shifts, we found Vericert’s designs to be about 2x slower than LegUp’s, which is much less objectionable. The remaining gap can be largely attributed to Vericert’s current lack of support for performing operations in parallel, and that’s a feature we’re working to add at the moment.

Of course, the main selling point of Vericert is its watertight guarantee of correctness, and this is something that we have evaluated empirically too. We used Csmith to generate thousands of random C programs and fed each one into Vericert. When we did this with other HLS tools, they tended to miscompile a few percent of the test programs, but we found that Vericert never produced an incorrect output (once we fixed a little bug in our pretty-printer, which is unverified). To be clear: Vericert quite frequently failed to produce any output at all, due to its lack of support for a few C language features that appeared in the test programs, but it never produced incorrect output.

Vericert is fully open-source and available on Github.

Screenshot 2021-08-27 at 11.25.25
wicko3
http://johnwickerson.wordpress.com/?p=2183
Extensions
What’s the Rush?
Uncategorized
Originally posted on Thinking:
At FPL 2021, my PhD student Jianyi Cheng (jointly supervised by John Wickerson) will present our short paper “Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining for HLS”. In this post, I explain the simple idea behind this paper and how it can significantly accelerate certain neglected corner…
Show full content

George Constantinides's avatarThinking

At FPL 2021, my PhD student Jianyi Cheng (jointly supervised by John Wickerson) will present our short paper “Exploiting the Correlation between Dependence Distance and Latency in Loop Pipelining for HLS”. In this post, I explain the simple idea behind this paper and how it can significantly accelerate certain neglected corner cases in high-level synthesis (HLS).

By far the most significant way to extract high performance from a hardware accelerator in high-level synthesis is to use loop pipelining. Loop pipelining is the idea of starting the next iteration of a loop before the previous one finishes, allowing multiple iterations to be executing simultaneously. However, some loop iterations may need a result produced by earlier loop iterations, limiting the extent to which this can be done. HLS tools generally determine a ‘safe’ initiation interval – the number of clock cycles between starting two adjacent loop iterations – and…

View original post 349 more words

wicko3
http://johnwickerson.wordpress.com/2021/08/17/whats-the-rush/
Extensions
Understanding the memory semantics of multi-threaded CPU/FPGA programs
AcademiaComputer ArchitectureComputer ScienceFPGAsProgramming Languages
If you’ve ever attended a seminar about weak memory models, chances are good that you’ve been shown a small concurrent program and asked to ponder what is allowed to happen if its threads are executed on two or three different cores of a multicore CPU. For instance, you might be shown this program: and asked… Continue reading Understanding the memory semantics of multi-threaded CPU/FPGA programs →
Show full content

If you’ve ever attended a seminar about weak memory models, chances are good that you’ve been shown a small concurrent program and asked to ponder what is allowed to happen if its threads are executed on two or three different cores of a multicore CPU. For instance, you might be shown this program:

// Thread 1:   ||  // Thread 2:
x := 1;        ||  y := 1;
r0 := y;       ||  r1 := x;

and asked what values can end up in the registers r0 and r1 if the program is executed on a multicore x86 machine, assuming that the shared locations x and y are both initially set to 0. The speaker might enjoy informing audience members unfamiliar with the intricacies of x86 memory semantics that the seemingly impossible outcome r0=r1=0 is actually allowed.

In a paper that Dan Iorga, Tyler Sorensen, Ally Donaldson and I will shortly present at OOPSLA 2021, we ask a slightly different question: what is allowed to happen if one of those threads is executed by a CPU core, and the other is simultaneously executed by a custom hardware design implemented on an FPGA?

That’s not quite as contrived a situation as you might think. In our era of heterogeneity, computing is increasingly being done by systems-on-chip that are made up of several different computing components (conventional CPUs, powerful GPUs, and programmable FPGAs), all with the potential to work together on a problem, each tackling the part that they are best suited to.

So what might happen if we run that program on a system-on-chip such as Intel’s recently-released Xeon+FPGA processor, which packs a multicore Xeon CPU and an Arria FPGA onto a single chip?

// Thread 1 (CPU):   ||  // Thread 2 (FPGA):
x := 1;              ||  y := 1;
r0 := y;             ||  r1 := x;

There are several reasons why weird behaviours, like the r0=r1=0 one we saw earlier, might still be allowed to happen. For one, the write buffers on the x86 CPU might cause the write of x to be delayed until after the read of y. For another, the FPGA might allow similar reordering; indeed, on the particular system-on-chip that we studied, memory access requests by the FPGA can be batched up and reordered before being dispatched. For a third, there are multiple channels that link the FPGA to the memory that it shares with the CPU, and memory requests in one channel might overtake those in another channel.

And there are other examples of concurrent programs where the CPU/FPGA combo can yield behaviours even more weird than those allowed on the CPU alone.

In our work, we’ve designed a mathematical model that explains which memory accesses by the CPU and by the FPGA can get reordered, and which ones can’t. We’ve validated our model by checking that it lines up with the available documentation from Intel, by talking it through with some Intel engineers who designed parts of the chip, and by generating and running hundreds of little concurrent programs as test-cases on a real machine (access to which was kindly provided to us by the Intel Labs Academic Compute Environment).

Our work shows how we can begin to reason about the correctness of small concurrent programs for these systems-on-chip. For instance, we use an encoding of our model in the CBMC model checker to confirm that a queue datastructure implemented between the CPU and the FPGA is implemented correctly. On the flip side, we’ve also conducted some experiments that show what can happen if this datastructure is implemented incorrectly – that is, we’ve shown that if some synchronisation operations are omitted, then the queue can indeed lose or duplicate some elements. (Which might not be the end of the world if you’re into approximate computing.)

Ultimately, we see this work as an important step towards providing a firm foundation for reasoning about programs that run on, and compilers that target, the hybrid computing devices that are fast becoming a central part of our computing landscape.

Screenshot 2021-07-02 at 17.09.47
wicko3
http://johnwickerson.wordpress.com/?p=2172
Extensions
Introducing C4: the C Compiler Concurrency Checker
Computer ScienceProgramming Languages
This post is about a paper by Matt Windsor, Ally Donaldson, and myself that will be presented shortly at the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) conference, in the tool demonstrations track. Compilers are a central pillar of our computing infrastructure, so it’s really important that they work correctly. There’s been… Continue reading Introducing C4: the C Compiler Concurrency Checker →
Show full content

This post is about a paper by Matt Windsor, Ally Donaldson, and myself that will be presented shortly at the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) conference, in the tool demonstrations track.

Compilers are a central pillar of our computing infrastructure, so it’s really important that they work correctly. There’s been a lot of work in the last decade or so on testing compilers by feeding them randomly generated programs and checking that they are compiled correctly. But most of this work has been limited to single-threaded (sequential) programs, which leaves open the question of whether compilers compile multi-threaded (concurrent) programs correctly.

There’s a good reason why existing work concentrates on the single-threaded case. The typical way to check that a randomly generated program has been compiled correctly is to compile it twice – whether that’s with two different compilers, or with two different sets of compiler flags, or before and after applying some semantics-preserving transformations – and see whether there’s any discrepancy between the two outputs. If there is, then one of the compilers must be doing something wrong! But this reasoning only works if every randomly generated program is deterministic – which multi-threaded programs seldom are.

So how can we expand compiler testing to the multi-threaded case? Our idea is to run the following sequence of steps in a loop:

  1. First, we generate a small multi-threaded program that is potentially tricky. By ‘small’ we mean around five or six instructions spread across two or three threads. And by ‘potentially tricky’ we mean that the program exercises a corner-case of the notoriously complicated C concurrency model. Lustig et al. showed in ASPLOS 2017 how to use the Alloy Analyzer to automatically generate suites of such programs directly from an axiomisation of the C concurrency model, so we adopt their technique here. Importantly, each program in this suite comes with a postcondition that characterises all the final states that the program is allowed to reach.
  2. Second, in the hope of making that small multi-threaded program more of a challenge for the compiler-under-test, we apply a series of semantics-preserving transformations to it, in the spirit of Lidbury et al.’s CLSmith tool, ending up with a large multi-threaded program. For instance, we might replace
      x := 42
    with
      if (getenv("john")) {
        x := 42
      } else {
        // lots of randomly
        // generated instructions

      }
    and then always run the generated code in an environment where the variable john has been set to 1. Actually we don’t need our transformations to be semantics-preserving in both directions; we just need to make sure that our transformations do not introduce new allowed outcomes (otherwise our tool might label correct behaviours as bugs). This means that we can also employ transformations like ‘randomly insert a fence’ or ‘randomly strengthen the consistency mode of an atomic operation from relaxed to release/acquire or from release/acquire to sequentially consistent‘. Importantly, because the small randomly generated program came with a postcondition, and because our transformations preserved that postcondition, we still know how our program should behave, even though it has now become rather large and unwieldy.
  3. Third, we compile our large multi-threaded program using the compiler-under-test, with randomly chosen command-line options. If the compiler crashes, then we have found a bug.
  4. Fourth, we execute the compiled program on a real machine. We execute the program thousands of times, in the presence of several ‘stressing’ threads that hammer on nearby memory locations in the hope of provoking weird behaviours. This hostile environment is created using the Litmus tool of Alglave et al. If the postcondition is violated on any of these runs, then we have found a bug.

This technique has been implemented in a tool called C4, which is available on GitHub. We applied C4 to several versions of four compilers (GCC, LLVM, the Intel C Compiler, and IBM XL) targetting three architectures (x86, IBM POWER9, and Arm8). C4 revealed two concurrency-related bugs in old versions of GCC, as well two previously-unknown bugs (which turned out, after test-case reduction, not to be related to concurrency) in IBM XL and GCC.

Of course, this is not as many bugs as we might have liked to have found! On the one hand, we suspect that if C4 employed more imaginative transformations, or if its testing throughput were optimised further, then it might be able to uncover more bugs, and this is a direction we’re pursuing. But on the other hand, we also have a growing suspicion that there may not be that many concurrency-related bugs in mainstream compilers in the first place. Perhaps compiler engineers make conservative choices when it comes to implementing concurrency, concerned by the inherent difficulties of testing multi-threaded code. Perhaps they do not shoot for super-aggressive optimisations like they do when compiling sequential code, and thus their code is less likely to contain the kind of faults that other compiler-testing tools have been so successful at exposing. Indeed, we experimented with injecting artificial bugs into LLVM, capturing the kinds of concurrency-related mistakes we’ve seen in other compilers, and found that C4 was effective at catching those. So, perhaps C4 will be end up being most valuable not for exposing existing bugs in compilers, but for giving compiler engineers the confidence to go ahead and implement more ambitious optimisations, safe in the knowledge that if they do introduce a bug, C4 will likely catch it.

Screenshot 2021-06-09 at 09.19.16
wicko3
http://johnwickerson.wordpress.com/?p=2162
Extensions
Fuzzing High-Level Synthesis Tools
AcademiaFPGAsProgramming Languages
High-level synthesis – the automatic compilation of a software program into a custom hardware design – is an increasingly important technology. It’s attractive because it lets software engineers harness the computational power and energy-efficiency of custom hardware devices such as FPGAs. It’s also attractive to hardware designers because it allows them to enter their designs… Continue reading Fuzzing High-Level Synthesis Tools →
Show full content

High-level synthesis – the automatic compilation of a software program into a custom hardware design – is an increasingly important technology. It’s attractive because it lets software engineers harness the computational power and energy-efficiency of custom hardware devices such as FPGAs. It’s also attractive to hardware designers because it allows them to enter their designs at a higher level of abstraction than Verilog or VHDL.

As such, high-level synthesis is being increasingly relied upon. But is it reliable? Yann Herklotz, Zewei Du, Nadesh Ramanathan and I have a paper next week at FCCM 2021 about our attempts to find out.

We follow a fairly well-trodden path. There have been several projects over the last decade or so that aim to test compilers by fuzzing – that is, by generating random source files, feeding each one into the compiler-under-test, and seeing whether it is compiled correctly. But these works have focused on conventional compilers that generate assembly code for processors. In our work, we apply a similar technique to HLS tools.

The ideas can be carried over without many changes – after all, HLS tools still take C programs as input, just like conventional compilers. We changed a few things: we restricted the random C generator to account for the fact that HLS tools don’t support the full C language, we added random HLS-specific directives (like: “pipeline this loop”) to the generated programs, and we tweaked the probabilities of the random generator to avoid generating programs that made the synthesis process unbearably long. To check that the HLS tool had compiled each random C program correctly, we executed its output using a Verilog simulator and compared that against the output from a executable that we compiled using GCC.

We generated 6700 random C programs and fed them to four widely used HLS tools: LegUp, Xilinx Vivado HLS, the Intel HLS Compiler, and Bambu. We found that all four tools could be made either to crash or to generate wrong code. In total, 1191 of our generated programs failed in at least one of the four tools, as shown in the picture below.

An Euler diagram showing which of our 6700 test-cases failed in which tools. Overlapping regions mean that the same test-cases failed in multiple tools.

We performed test-case reduction on some of the failures, and this resulted in the identification of 8 unique bugs, several of which have been confirmed by the HLS tool developers, and one of which (shown in the code snippet below) has already been fixed.

This program miscompiles in Bambu. Because the value of b is shifted to the right by 8, the output should be 0x100, but Bambu generates a design that returns 0. The increment operation on a appears unrelated, but is actually necessary to trigger the bug.

Finally, we compared how many bugs we could find in a few different versions of the same HLS tool. The diagram below shows the results of that experiment on three versions of Xilinx Vivado HLS. We see that the 2018.3 version of the tool has the most failures, though it is also striking that there are some test-cases that only fail in the most recent version.

This Sankey diagram tracks 3645 test-cases through three different versions of Xilinx Vivado HLS. The ribbons collect the test-cases that pass and fail together. The black bars are labelled with the total number of test-case failures per version. The test-cases that pass in all three versions are not shown.

In conclusion: this project has shown that compiler fuzzing techniques are effective at uncovering bugs in HLS tools. With HLS tools becoming increasingly relied upon, we hope that our work has motivated the need for these tools to be more rigorously engineered.

Screenshot 2021-05-07 at 15.05.57
wicko3
http://johnwickerson.wordpress.com/?p=2146
Extensions
A translation puzzle
Computer ScienceMathematics
Your task is to write down a sequence of English words that, after possibly moving around the spaces between the words, become their German translations. As stated, this is pretty easy: for instance, I could write down the single English word SAND which translates to the single German word SAND. Or I could be a… Continue reading A translation puzzle →
Show full content

Your task is to write down a sequence of English words that, after possibly moving around the spaces between the words, become their German translations.

As stated, this is pretty easy: for instance, I could write down the single English word SAND which translates to the single German word SAND. Or I could be a tiny bit more ambitious and write down a sequence of words like SAND PARK RADIO, which is similarly unchanged after translating into German.

But suppose we generalised away from just “English and German” and used a pair of languages that share an alphabet but don’t have any words that self-translate like that. Then the problem becomes rather harder because you might have to actually shuffle the spaces around in order to find a solution.

For instance, perhaps the dictionary looks like this:

Language 1Language 2CHEMOUSMOUSMOUSMOUSTAMOUSTACHECHETACHEMOUS

Can you come up with a sequence of words in Language 1 that, after possibly moving around the spaces, become their Language 2 translations? Here’s one:

"MOUS TA CHEMOUS MOUS MOUSTACHE"

is a sequence of words in Language 1, and if I move the spaces around I can get

"MOUSTA CHEMOUS MOUS MOUSTA CHE"

which is exactly the same sequence translated into Language 2.

Although that particular example wasn’t too hard in the end, it turns out that this is a rather hard problem to solve in general. In fact, it’s an instance of the Post Correspondence Problem, which is famously undecidable. That means that it’s impossible to write a program that is capable of taking any translation dictionary you can think of and telling you whether or not a solution exists.

Might still be fun to try, though.

s-l500
wicko3
http://johnwickerson.wordpress.com/?p=2114
Extensions
Diagrams for Composing Compilers
Computer ScienceProgramming Languages
T-diagrams (or tombstone diagrams) are widely used by teachers to explain how compilers and interpreters can be composed together to build and execute software. In this blog post, Paul Brunet and I revisit these diagrams, and show how they can be redesigned for better readability. We demonstrate how they can be applied to explain compiler… Continue reading Diagrams for Composing Compilers →
Show full content

T-diagrams (or tombstone diagrams) are widely used by teachers to explain how compilers and interpreters can be composed together to build and execute software. In this blog post, Paul Brunet and I revisit these diagrams, and show how they can be redesigned for better readability. We demonstrate how they can be applied to explain compiler concepts including bootstrapping and cross-compilation. We provide a semantics for our redesigned diagrams, based on binary trees. Finally, we suggest how our diagrams could be used to analyse the performance of a compilation system.

Introduction

In introductory courses on compilers across the globe, students are taught about the interactions between compilers using ‘tombstone diagrams’ or ‘T-diagrams’. In this formalism, a compiler is represented as a ‘T-piece’. A T-piece, as shown below, is characterised by three labels: the source language of the compiler, the target language of the compiler, and the language in which the compiler is implemented.

A compiler from source language ‘src’ to target language ‘tgt’ implemented in language
‘imp’.

Complex networks of compilers can be represented by composing these basic pieces in two ways. Horizontal composition means that the output of the first compiler is fed in as the input to the second. Diagonal composition means that the first compiler is itself compiled using the second compiler. The pictures below give examples of both forms.

Left: Horizontal composition of T-pieces. Haskell is compiled to C using a compiler written in Java, and thence to ARM assembly using another compiler that is also written in Java. Right: Diagonal composition of T-pieces. An OCaml-to-x86 compiler written in C is being compiled using a C-to-x86 compiler written in x86 assembly.

In this blog post, we investigate the foundations of these diagrams. What are the rules that govern how they can be composed? Can we redesign them to make them more understandable? What do they mean in general? And what do they tell us about the compilers they depict?

Background

The usefulness of diagrams for explaining the connectivity between compilers can be traced back at least as far as the UNCOL project in the 1950s, which was an early effort to provide a ‘Universal Computer-Oriented Language’ that could interface with many different front-ends and back-ends. Diagrams such as the one reproduced below were used to show how compilers producing UNCOL could be composed with those that consume it.

A precursor to the T-diagram. An UNCOL-to-704 compiler (top left) running on an IBM 704 machine (middle) is used to compile an OTN-to-UNCOL compiler implemented in UNCOL (bottom left), which results in an OTN-to-UNCOL compiler implemented in 704 machine code (right).

Shortly thereafter, T-diagrams were proposed by Bratman as an improvement on the UNCOL diagrams. As an example, the diagram below shows Bratman’s reimagining of the diagram above.

The first T-diagram. This T-diagram depicts the same information as the previous diagram.

Three problems with T-diagrams are immediately apparent.

  1. When two T-pieces are diagonally composed, they meet at two interfaces, and it is not clear that only one of these interfaces is required to have matching languages. Diagonal composition only requires the ‘implementation’ language of the left piece to match the ‘source’ language of the right piece.
  2. T-diagrams typically do not distinguish the operands to the composition from the result of the composition. For example, in Bratman’s picture above, the right T-piece is actually the result of composing the left T-piece with the middle T-piece, but this relationship is not made clear by the diagram – all three pieces appear on an equal footing. (Of course, the obvious way to avoid this problem is simply not to show the result of the composition in the same diagram.)
  3. The symmetric shape of the T-pieces invites a third mode of composition, pictured below, that is not actually meaningful. (Note that this form of composition does seem to appear in Bratman’s picture above, but this is only because the operands and the result of composition are being conflated.)
Only one of the two joining interfaces requires its languages to match (Problem 1).
This form of composition looks legal but is actually not (Problem 3)

Over the next couple of decades, several other diagrammatic systems were proposed, as pictured below. For example, Burkhardt proposed replacing Bratman’s T-pieces with X-shaped pieces, Sklansky et al. suggested D-shaped pieces, Rosin used pieces with a variety of shapes, and Earley and Sturgis extended Bratman’s T-pieces with I-pieces that represent interpreters. However, none of them satisfactorily resolved all three problems identified above.

A Burkhardt diagram of a compiler that accepts programs in language ‘N’, generates programs in language ‘SPS’, and is implemented in IBM 709 machine code.
A diagram by Sklansky et al. Step 1 shows an algorithm q written in language L (q/L) being compiled by an L-to-M compiler implemented in M (L/MM) on a machine M, to produce an implementation of q in M (q/MM). In Step 2, this implementation is executed on machine M with data δ, to produce the result q(δ).
A Rosin diagram. The input data is written in the ‘PI’ language; this is compiled to the ‘PO’ language by a compiler implemented in IBM 360 machine code. That compiler is running in a interpreter for 360 machine code that is implemented in 2065 machine code, which itself is running on an IBM 2065 machine.
An Earley–Sturgis diagram. The f function is written in Lisp, then translated to machine language (ML) by a compiler written in Lisp. That compiler is running inside a Lisp interpreter written in ML. The dashed line seems to be an attempt to clarify that the right-hand piece is the result of the composition (cf. Problem 2).
Our proposal: J-diagrams

We propose a redesign in which the basic piece has the shape of a backwards ‘J’.

This design makes it clear that there are two (and only two) modes of composition, as shown in the pictures below. Thus Problem 1 and Problem 3 are immediately addressed. To address Problem 2, we simply propose that J-pieces that are the result of the composition are not shown in the same diagram.

Left: Horizontal composition of J-pieces. Right: Diagonal composition of J-pieces.

We also propose two special cases of the J-piece that omit one interface. An I-piece omits the ‘target’ interface; it represents an interpreter, following Earley and Sturgis. A ‘tube’ omits the ‘implementation’ interface; it represents a compilation step that does not require an implementation, such as compiling Clight [4] to its superset, C. A ‘stopper’ omits both the ‘target’ and ‘implementation’ interface; it represents a machine that can directly execute programs in the ‘source’ language.

Left: An I-piece. Middle: A tube. Right: A stopper.

Next, we illustrate J-diagrams using several examples: Java compilation, verified C compilation using CompCert, a compiler testing framework, cross-compilation, and the history of the XPL programming language. In each case, we argue that the J-diagram is intuitive and clear, especially when compared to an equivalent T-diagram or ad-hoc diagram.

Java compilation as a J-diagram. A Java program is compiled to Java Bytecode using the javac compiler, which is written in Java. The JVM is an interpreter for Java Bytecode written in C++. The JVM itself is compiled using a C++ compiler such as gcc. Both javac and gcc must themselves be compiled somehow, but this diagram elects to leave those steps unspecified.
A J-diagram showing the high-level architecture of the CompCert compiler. The initial translation from C into the Clight sublanguage is not verified, but the rest of the compiler is written in Coq. The Coq tool is used to extract executable OCaml code.
A J-diagram showing how compilers written by students are assessed. Students submit a C-to-MIPS compiler implemented in C++, which is built using g++. The output of this compiler is tested by running it on the qemu MIPS emulator, which is implemented in C and built using gcc. In this diagram, we assume that g++ and gcc have both been compiled for an x86 machine.
Explaining cross-compilation using (top) an ad-hoc diagram from Wikipedia and (bottom) a J-diagram. The aim here is to use one machine (say, running Windows on an IA-32 processor) to build a compiler that can execute on another machine (say, running Mac OS on an x86_64 processor) and whose output that be executed on a third machine (say, running Android on an ARM processor). In both cases, the sequence is as follows. We first use the native Microsoft Visual C++ (msvc) compiler on the Windows machine to build the native g++ compiler for that machine. We then use this compiler to build a g++ compiler that targets the Mac. This compiler, in turn, is used to build the final compiler, which will execute on the Mac but will generate programs that execute on the Android device.
Explaining the history of the XPL language using (top) a T-diagram and (bottom) a J-diagram. The T-diagram, which is due to McKeeman, is hard to understand because of the large number of T-pieces and because of the ambiguity about which connections are meaningful. As a J-diagram, the history of the XPL language becomes much easier to read. The first XPL compiler (XCOM I) targeted IBM 360 machine code, and was written in ALGOL, which itself was compiled and executed on a Burroughs B5500 computer (far right). The second XPL compiler (XCOM II) was written in XPL, and was compiled using XCOM I, via a bootstrapping process. Similarly, the third version (XCOM III) was compiled using XCOM II. At the far left of the diagram, we see XPL being used to produce a compiler for a new user.
Interpreting J-diagrams

Our J-diagrams can be interpreted as vertex-labelled binary trees. As shown below, each J-piece represents a tree with three vertices, with each vertex labelled either with a language or with the distinguished ‘•’ symbol.

Interpreting J-pieces, I-pieces, tubes, and stoppers as vertex-labelled binary trees.

Composition of pieces corresponds to gluing trees together so that a leaf of the first tree has the same label as the root of the second tree. The picture provides an example of a tree obtained in this way.

The J-diagram from earlier in this blog post, interpreted as a tree.

We claim that interpreting a J-diagram as a tree in this way is natural. At the root of the tree is the ‘source’ language we wish to execute. At the leaves of the tree are all the languages that we need to be able to execute in order to be able to execute the language at the root of the tree. There is actually no particular need to distinguish between the ‘implementation’ language and the ‘target’ language – both are simply languages that we need to be able to execute. In other words, a compiler can be seen as a scheme for reducing the problem of executing programs in its ‘source’ language into two different problems:

  • the problem of executing programs in its ‘target’ language (the compiler’s output) and
  • the problem of executing programs in its ‘implementation’ language (the compiler itself). For example, the J-diagram above can be thought of as a method for executing C programs that depends upon a method for executing x86 programs.
Using J-diagrams to analyse compilers

We now suggest how J-diagrams could be used to aid the analysis and comparison of compilation strategies.

The picture below shows a graph of possible compilation strategies from LISP to JavaScript. This is the kind of graph generated by Akram’s tool for discovering all ways – whether direct or indirect – of compiling between any two given languages. Each vertex represents a language, and each edge represents a compiler from one language to another.

A graph of possible compilation strategies from LISP to JavaScript.

What is missing from this graph is information about how each compiler is implemented, and hence which further compilers may be needed in order to build them. To address this shortcoming, the picture below shows both possible compilation strategies as a pair of J-diagrams. By using this representation, it becomes clear that if we compile via Java Bytecode we shall also need the ability to execute Java code, and if we compile via LLVM IR we shall also need the ability to execute C++ code. Thus, more information is made available for the designer to make an informed choice.

Designing a compilation strategy from LISP to JavaScript with the help of J-diagrams. Left: LISP is compiled to Javascript via Java Bytecode using ABCL and TeaVM, both of which are implemented in Java. Right: LISP is compiled to Javascript via LLVM IR using Clasp and Emscripten, both implemented in C++.

If we have additional information about each J-piece beyond simply the languages involved, then we can use J-diagrams as an aid for reasoning about the performance of compilation strategies, as outlined next.

For any single J-piece, it is natural to ask:

  • How good is the generated output? Or, to coin a phrase, how well-targeted is the compiler? As a simple example, a compiler that inserts unnecessary sleep() instructions into its generated output is probably not well-targeted.
  • How long does it take to run? Or, to coin another phrase, how well-implemented is the compiler? As a simple example, a compiler whose source code includes unnecessary sleep() instructions is probably not well-implemented.

Then, for a J-diagram with source language S, it is natural to ask: if I use this compilation strategy to execute programs in language S, how quickly will I get the result?

A straightforward answer is that it depends upon how well-targeted and well-implemented all the compilers in the network are, since all of the compilers must be run before the final result can be obtained. For instance, in the J-diagram of the CompCert compiler above, in order to calculate the result of running the input C program, we need to run Coq to obtain an OCaml implementation of the verified compiler, then we need to run both the unverified compiler and the verified compiler. If the Coq compiler is well-targeted, then the verified compiler will be well-implemented.

A more nuanced answer involves distinguishing the time taken to obtain an executable (“compilation time”) from the time taken to run that executable (“running time”). In fact, there are several compilation times here: the time taken to compile the source program, the time taken to compile that compiler, the time taken to compile that compiler, and so on.

Traditionally, one is more concerned with improving the running time than the compilation time, since an executable can be compiled once and then run many times. In turn, the time taken to compile the executable tends to be more important than the time taken to compile the compiler, since a compiler can be compiled once and then used many times.

These various compilation times correspond to the different rows of a J-diagram. The running time depends upon how well-targeted the top row of J-pieces are, and how efficient is the machine or interpreter that runs this executable. The time taken to obtain this executable, on the other hand, depends upon how well-implemented the top row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces (in the second row) that compile them. The time taken to obtain the compiler that obtains this executable depends upon how well-implemented the second row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces in the third row, and so on.

For instance, the time taken to obtain the CompCert compiler depends upon how well-implemented the Coq compiler is. The time taken to use CompCert obtain an executable from a user-provided C program depends upon how well-implemented the unverified and verified compilers are, and how well-targeted the Coq compiler is. The running time of this executable depends upon how well-targeted the unverified and verified compilers are.

We can formalise this using the runningTime and compileTime functions defined below. As an example, we can see by unfolding those definitions that the ‘compile time’ for the J-diagram about student compilers depends upon the wellimplementedness of the student’s compiler and the welltargetedness of g++, and that the ‘running time’ for that J-diagram depends upon the welltargetedness of the student’s compiler, the wellimplementedness of qemu, and the welltargetedness of gcc, all of which is as expected.

How the running time and compile time of a program depend upon the well- implementedness and well-targetedness of the various compilers involved. Following a Prolog-like notation, we write ‘:−’ to mean ‘depends upon’.
Conclusion

We have investigated the foundations of diagrams that describe how compilers can be composed. To conclude, let us return to the four questions posed in the introduction.

  • First, we explained that existing graphical systems, chiefly T-diagrams, are unsatisfactory because the rules about how compilers can be composed are not intuitive; for instance, one form of diagonal composition is legal but the symmetric one is not.
  • Second, we presented a new visual language, based on ‘backwards J’-shaped pieces, that addressed the shortcomings of previous systems.
  • Third, in answer to the question of what these diagrams ‘mean’, we showed how they can be interpreted quite straightforwardly as binary trees. A J-diagram can thus be thought of as a strategy for problem-reduction: it describes how the problem of executing programs in the language at its root vertex can be reduced to the problems of executing programs in all the languages at its leaf vertices.
  • Fourth, in answer to the question of what these diagrams tell us about the compilers they depict, we suggested how they could be a useful aid when comparing different compilation strategies, or when analysing which parts of the compilation process are on the ‘critical path’ regarding compiling time or running time.

Ultimately, we expect J-diagrams will prove most valuable in teaching settings, because they are – we believe – an improvement upon the T-diagrams that are already in widespread use.

The authors would like to thank Diana Costa, Alastair Donaldson, Roser Pujadas, Will Venters, and Matt Windsor for helpful discussions.

screenshot-2020-05-20-at-11.58.55
wicko3
http://johnwickerson.wordpress.com/?p=2078
Extensions
Uncovering multicore interference
Uncategorized
Originally posted on Dan's Blog:
If I visit an exotic destination, I would surely do some sightseeing. I would cycle to that destination and I would use Google Maps for navigation. At most crossroads, I need to know in which direction I should turn. Other apps running on my phone might delay the indications…
Show full content

Dan Iorga's avatarDan's Blog

If I visit an exotic destination, I would surely do some sightseeing. I would cycle to that destination and I would use Google Maps for navigation. At most crossroads, I need to know in which direction I should turn. Other apps running on my phone might delay the indications provided by Google Maps and therefore I can mistakenly make a wrong turn and possibly be eaten by some large animal.

The reason why this is happening is that on all modern multicore processors, applications on different cores of the system are competing for the shared resources. This is not a problem in most applications but it can be in time-sensitive ones. If we design any such time-sensitive programs, we need to know what to expect in terms of interference so that we can accommodate for the delays. The most common approach for evaluating delays in multicore systems is to run…

View original post 406 more words

wicko3
http://johnwickerson.wordpress.com/2020/04/20/uncovering-multicore-interference/
Extensions
Highlights from FPGA 2020
AcademiaFPGAs
Here are a few personal highlights from the FPGA 2020 conference, which took place this week in Seaside, California. (Main photo credit: George Constantinides.) Jakub Szefer‘s invited talk on “Thermal and Voltage Side Channels and Attacks in Cloud FPGAs” described a rather nifty side-channel through which secrets could be leaked in the context of cloud-based… Continue reading Highlights from FPGA 2020 →
Show full content

Here are a few personal highlights from the FPGA 2020 conference, which took place this week in Seaside, California. (Main photo credit: George Constantinides.)

Jakub Szefer‘s invited talk on “Thermal and Voltage Side Channels and Attacks in Cloud FPGAs” described a rather nifty side-channel through which secrets could be leaked in the context of cloud-based FPGAs. The context is that one user of an FPGA would like to transmit information secretly to the next user of the same FPGA. Jakub suggests transmitting this information via the physical temperature of the FPGA. His proposal is that if the first user would like to transmit a ‘1’, they heat up the FPGA by running a circuit that uses a lot of current, such as a ring oscillator; if they would like to transmit a ‘0’, they don’t. The second user, once they arrive, measures the temperature of the FPGA. Jakub showed that even if there is a reasonable gap between the first user finishing and the second user starting, the measurement can still be taken fairly reliably because the heat is dissipated quite slowly. Follow-on work asks whether it is possible to transmit more information by just heating up (and then measuring the temperature of) parts of the FPGA.

Stefan Nikolic‘s paper on “Straight to the Point: Intra- and Inter-Cluster LUT Connections to Mitigate the Delay of Programmable Routing” (with Grace Zgheib and Paolo Ienne) struck me as a nice example of ‘out of the box’ thinking. FPGAs are designed to allow any two LUTs to be connected to one another, but because of their rigid structure, the route between them may be a little slow and circuitous. Stefan’s paper asks what would happen if we could just stick a direct wire between certain pairs of LUTs, like you can when designing fully-custom hardware. It’s a hypothetical question, of course, because FPGAs don’t actually allow this. But Stefan’s experiments using a simulator suggest that some significant performance improvements could be obtained if they did, so perhaps the FPGA manufacturers should bear this in mind for the future!

Philip Leong‘s presentation on “LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations” (on joint work with Seyedramin Rasoulinezhad, Siddhartha, Hao Zhou, Lingli Wang, and David Boland) was motivated by the observation that in several important applications, a lot of the LUTs on an FPGA are employed to implement XOR operations. His proposal is to modify the design of the FPGA itself to add dedicated XOR gates for handling these operations. The idea is that dedicated XOR gates can implement XOR operations rather more efficiently than general-purpose LUTs can. Moreover, by moving XOR operations out of the LUTs, those LUTs can be more usefully employed for other things. I found it rather interesting to note the connection between the work Philip described and some previous work from Imperial on LUTNet. Both take as their starting point the observation that a lot of LUTs are being used to implement XOR operations. But where Philip’s paper suggests moving these XORs to their own hardware, LUTNet suggests replacing the XOR operations with different ones that can still be implemented on a LUT but may be more suitable for that particular design, thus exploiting the flexibility of LUTs. Perhaps future work will see both solutions combined together.

Vinod Kathail from Xilinx gave a keynote about the new Vitis high-level synthesis (HLS) framework. Vitis HLS seems quite appealing to me, for two reasons. One is that it seems to expose rather more of the internal stages of the HLS process than Xilinx’s previous tools, which should make it easier for us to prototype research ideas for HLS. The other is that Vitis HLS promises not to need the plethora of ‘pragma’ annotations that current HLS tools expect programmers to provide. If it is really the case that Vitis HLS can manage to produce performant hardware without relying on these hints, then we may be finally heading towards a world where software programmers really can produce hardware designs without needing a degree in electrical engineering!

Sameh Attia‘s paper on “StateMover” (with Vaughn Betz) described a fascinating system for debugging hardware that sits halfway between simulating it and executing it. The advantage of simulation is its flexibility: the designer can inspect (or even change) the contents of every component, and they can run the clock forwards or backwards until they pinpoint the problem in their design. But the disadvantage of simulation is its slowness: simulating a design can take about 100000 times longer than executing it properly. Sameh’s work describes a way to move the entire state of a design back and forth between an FPGA and a software simulator. His proposal is that a design starts running on an FPGA, then at some predetermined trigger point, the clock is stopped and the current state of the FPGA is copied over to the simulator. The designer can then investigate what is going on at this point, and maybe step the clock forwards or backwards a bit, and then they can copy the new state back to the FPGA and carry on running it. The best of both worlds is thus achieved: the fiddly parts of the design can be run in the simulator, and the long, uninteresting parts can be left to the hardware. Myself, I wondered whether this system could form the basis of an exhaustive hardware simulator: one that runs a design in hardware until comes to some sort of ‘branch’ or ‘decision point’, at which point it saves its current state, and executes one of the paths. Later it restores this saved state and executes the other path. Thus, it might be feasible to explore all the possible behaviours of a system.

Lana Josipovic‘s Best-Paper-award-winning paper on “Buffer Placement and Sizing for High-Performance Dataflow Circuits” (with Shabnam Sheikhha, Andrea Guerrieri, Paolo Ienne, and Jordi Cortadella) was concerned with improving the performance of dynamically scheduled circuits, which can be thought of as a little like Petri nets. Inserting buffers into these circuits cannot change their functional behaviour, because every component executes when (and only when) all of its inputs are ready; adding buffers may mean that these inputs become ready a little later, but it cannot change their values. Nonetheless, there are a few reasons for inserting buffers into a circuit: to break up the critical path so as to increase the operating frequency, to break a combinational cycle, or to balance the latency of the tines in a control flow fork. Lana uses linear programming to work out when to place buffers in a circuit, where to place them, and how big to make each one; the result is an HLS tool that can generate considerably faster circuits.

Han Chen‘s paper on “FPGA-Accelerated Samplesort For Large Data Sets” (with Sergey Madaminov, Michael Ferdman, and Peter Milder) addresses one of the most long-studied problems in computer science: how can we sort data quickly? They describe a hardware-accelerated implementation of samplesort, which is a pleasingly simple sorting algorithm that I had not come across before. Samplesort is a little like mergesort, in that it involves partitioning the dataset into lots of parts that are all sorted independently. But where mergesort splits the dataset blindly, sorts each part, and finally merges the sorted subsets back together, samplesort first divides the dataset into several buckets (with each bucket covering a specific range of values), then sorts each bucket, and finally concatenates the buckets together. So both algorithms have an embarrassingly-parallel middle step, but mergesort has its complexity in the third step and samplesort has its complexity in the first step. Of course, the tricky part of samplesort is working out what ranges to use for each bucket so that each one ends up containing roughly the same number of data items. This is addressed by having a extra preprocessing stage, in which the data is sampled (hence the name of the algorithm) to estimate the appropriate range for each bucket. There are all sorts of interesting trade-offs in Han’s design. How many buckets should there be? (Presumably this depends on how much many buckets it is possible to sort in parallel.) How long should we spend sampling? (Time spent sampling is time not spent sorting, but if the sampling is inaccurate then the subsequent sorting will not be efficient.) How much contingency space should there be in each bucket? (If the buckets are too small, they may spill over, but if they are unnecessarily big, then hardware resources are being wasted.)

Finally, there were also two papers co-authored by me. Yann Herklotz‘s paper on “Finding and Understanding Bugs in FPGA Synthesis Tools” described his efforts to find bugs in tools like Quartus, Vivado, and Yosys; his presentation was greeted with an amusing mixture of alarm and pride from the developers of these tools, depending on the respective number of bugs he found in each! And Jianyi Cheng‘s paper on “Combining Dynamic and Static Scheduling in High-level Synthesis” (also with Lana Josipovic, George Constantinides, and Paolo Ienne) described his quest for a scheduling approach that combines the advantages of static (i.e. compile-time) scheduling and dynamic (i.e. run-time) scheduling. Both Yann and Jianyi did a great job with their presentations, making me a very proud supervisor!

imperial_fpga
wicko3
http://johnwickerson.wordpress.com/?p=2070
Extensions
New names for odd and even functions
Mathematics
The concept of “odd” and “even” functions is quite important in areas such as Fourier analysis. An odd function is one that has 2-fold rotational symmetry around the origin; examples include , , and . An even function is one that has mirror symmetry around the y-axis; examples include and . The names “odd” and… Continue reading New names for odd and even functions →
Show full content

The concept of “odd” and “even” functions is quite important in areas such as Fourier analysis. An odd function is one that has 2-fold rotational symmetry around the origin; examples include y=x, y=x^3, and y=\sin x.

An even function is one that has mirror symmetry around the y-axis; examples include y=x^2 and y=\cos x.

The names “odd” and “even” come from the exponents in the examples above. For instance, y=x^3 is called an odd function because 3 is an odd number.

However, “odd” and “even” are actually not very good names for these functions. The thing is, there are several useful identities related to these functions, such as

whenever you add an odd function to another odd function you get an odd function

and

whenever you multiply an odd function by an even function you get an odd function

and almost none of these line up with the well-known identities for odd and even numbers. In mathematics, the whole point of using the same name for two different concepts is to tell the reader that much of the intuition from one concept can be carried over to the other. But this does not happen here. When you add an odd number to another odd number you get an even number, not an odd one. When you multiply an odd number by an even number you get an even number, not an odd one. Moreover, some functions are both odd and even (namely, the y=0 function), but no numbers are both odd and even. The analogy between odd/even functions and odd/even numbers simply does not work at all.

Odd/even functionsOdd/even numbersodd + odd = oddodd + odd = eveneven + odd = could be eithereven + odd = oddeven + even = eveneven + even = evenodd × odd = evenodd × odd = oddeven × odd = oddeven × odd = eveneven × even = eveneven × even = even

It would be better, I think, if odd functions were renamed as “negative” functions, and even functions were renamed as “positive” functions. These names aim to build an analogy to negative and positive numbers, and this time the analogy actually works. When you add a negative number to another negative number you get a negative number. When you multiply a negative number by a positive number you get a negative number. And the y=0 function corresponds naturally to the number 0 which is both positive and negative (or neither, if you prefer).

Negative/positive functionsNegative/positive numbersneg + neg = negneg + neg = negpos + neg = could be eitherpos + neg = could be eitherpos + pos = pospos + pos = pospos + pos = pospos + pos = pospos + pos = pospos + pos = pospos × pos = pospos × pos = pos

Of course, there are disadvantages to calling these functions “negative” and “positive”. For instance, calling a function “positive” might give the misleading impression that it lies entirely above the y=0 line. Still, no names are perfect, and I reckon that this disadvantage notwithstanding, my proposed names are better than the current ones.

Screenshot 2020-02-19 at 16.00.03
wicko3
http://johnwickerson.wordpress.com/?p=2058
Extensions
Fuzzing FPGA synthesis tools
AcademiaFPGAs
When you want to do some computation on an FPGA, it is traditional to enter your design in a language like Verilog, and then to use automatic synthesis tools to turn your Verilog design into a “configuration bitstream” that can be fed to your FPGA to make it perform the computation you want. These synthesis… Continue reading Fuzzing FPGA synthesis tools →
Show full content

When you want to do some computation on an FPGA, it is traditional to enter your design in a language like Verilog, and then to use automatic synthesis tools to turn your Verilog design into a “configuration bitstream” that can be fed to your FPGA to make it perform the computation you want.

These synthesis tools occupy an extremely privileged position. They are, after all, responsible for almost all of the world’s FPGA-based computation. There is no feasible way to compute with an FPGA other than via these tools. They also occupy an extremely trusted position. We have to trust that they faithfully reproduce the computation we specified in Verilog, because we have no completely reliable way to check whether they have made a mistake. This is especially the case given that the details of how FPGAs are configured are generally not made public.

In our FPGA 2020 paper, my student Yann Herklotz and I ask whether these tools really deserve our trust.

To answer this, we set about doing some fuzzing. We were inspired by a tool called Csmith, whose sole purpose is to generate large, random C programs. These C programs don’t do anything in particular, but they are a nightmare for compiler-writers because they combine C language features in weird ways that humans wouldn’t ordinarily think of. Programs generated by Csmith have exposed hundreds of bugs in widely-used C compilers. We wanted to see if we could do the same thing for FPGA synthesis tools.

To this end, we produced a tool called Verismith that generates large, random Verilog programs. Like Csmith, we have to take care that the programs we generate are “legal”; after all, if you put bogus Verilog into a synthesis tool, you can’t expect sensible output. This means making sure that we don’t do things like drive a wire from two different sources, pass the wrong number of bits to a module, or divide by zero.

We used Verismith to generate about fifty thousand Verilog programs, and we gave each of these to four different FPGA synthesis tools: Xilinx XST, Xilinx Vivado, Intel Quartus, and Yosys. The output of these tools is a netlist – a list of interconnected low-level logic gates. (This netlist can then be turned into a configuration bitstream by other automatic tools that we have not tested.) To check whether the synthesis tool has worked correctly, we ask a SAT solver if there are any discrepancies between the input Verilog with the output netlist. If there are, then we have found a bug!

Each time we find a bug, we use Verismith to reduce it so that it is as small (and hence understandable) as possible. To do this, Verismith repeatedly chops the Verilog program in half, each time discarding the half that does not trigger the bug, until it cannot make it any smaller.

After reducing all the buggy programs to be as small as possible, we decided that some of these programs are really exposing the same bug. Ultimately, we found and reported 4 bugs in Yosys, 5 in Xilinx Vivado, and 1 in a Verilog simulator called Icarus. All of these bugs were confirmed by the vendors. (We didn’t report the bugs we found in Xilinx XST as this tool is no longer supported.)

Our programs trigger two different kinds of bugs in the synthesis tools. Some of them trigger “crash” bugs, which means that the synthesis tool goes wrong even though the input is perfectly valid. Others trigger “mis-synthesis” bugs, which means that the synthesis tool produces a netlist that is not equivalent to the input Verilog. In our paper, we dig into a handful of the bugs that we found, to explain exactly what the tools are doing wrong.

It’s hard to say for sure how much these bugs really matter. If they only come to light when the tools are given fairly baroque Verilog that no human would realistically write, and no Verilog-targeting compiler would automatically generate, then perhaps we need not be too concerned. That said, one observer on the Xilinx bugtracking forums did comment that one of the bugs we found “looks […] to be a rather critical bug”. Personally, I think any bug in a synthesis tool is enough to undermine its trustworthiness, whether or not it is likely to affect the computation that I want to do on my FPGA.

Screenshot 2019-11-22 at 14.52.12
wicko3
http://johnwickerson.wordpress.com/?p=1942
Extensions
A plot twist! Drawing better graphs in PL papers
Uncategorized
In this post, I’d like to share some thoughts I’ve accumulated over the past few years about how to draw better graphs. To get straight to the point, I have two concrete recommendations: normalised data should usually be plotted on a logarithmic scale, and scatter plots can be easier to understand than bar charts. I’ll… Continue reading A plot twist! Drawing better graphs in PL papers →
Show full content

In this post, I’d like to share some thoughts I’ve accumulated over the past few years about how to draw better graphs.

To get straight to the point, I have two concrete recommendations:

  • normalised data should usually be plotted on a logarithmic scale, and
  • scatter plots can be easier to understand than bar charts.

I’ll now elaborate on both of these points, drawing upon examples from 31 examples of graphs I found in the proceedings of PLDI 2019.

Logarithmic scales for normalised data

I believe that normalised data should be plotted on a logarithmic scale. By “normalised data”, I mean data that is the ratio between two measurements that have the same dimension. For example: the ratio between the execution time of a program before a proposed compiler optimisation has been applied and the execution time of that program afterwards.

I will illustrate my reasons with reference to the two graphs below, which both show some sort of “speedup” that has been obtained on four benchmark programs, A, B, C, and D. The left graph uses a linear scale on the y-axis, while the right one plots the same data on a logarithmic scale.

There are four reasons why the logarithmic scale is better:

  1. The natural origin for a speedup ratio is 1, not 0. That is, we are primarily interested in seeing whether a data point lies above 1 (which indicates a speedup) or below 1 (which indicates a slowdown). This fits nicely with logarithmic scales, which can’t go down to 0.  In the right-hand graph above, it is immediately obvious that A and B experience a slowdown; this is slightly less obvious in the left-hand graph.
  2. Going from a 1x speedup to a 2x speedup is surely more impressive than going from a 3x speedup to a 4x speedup. But on the linear y-axis in the left-hand graph above, the distance between 1 and 2 is the same as the distance between 3 and 4, so these feats would appear equally impressive.
  3. Often, it is just as good to get a 2x speedup as it is bad to get a 2x slowdown. But on the linear y-axis in the left-hand graph above, the distance from 1 to 0.5 is much smaller than the distance from 1 to 2, so the speedup experienced by benchmark C is emphasised over the slowdown experienced by benchmark B, even though both have the same magnitude.
  4. On a linear scale, the “centre of gravity” to which the eye is drawn lies at the arithmetic mean, while on a logarithmic scale, the centre of gravity is at the geometric mean. When averaging dimensionless ratios, most authors tend to use the geometric mean.

One caveat: the logarithmic scale doesn’t work well if the normalised data can be zero. This is quite rare though – I don’t think I saw this in any of the PLDI 2019 papers I scanned through when writing this post.

Here are some examples of graphs from PLDI 2019 that I believe would be improved by using a logarithmic scale on their y-axes.

This first graph demonstrates why y=0 is not the best origin – if the bars started at y=1 (and then stretched upward for speedups or downwards for slowdowns) then there would be no need for the thick black line at y=1.
This graph would also be clearer if its bars emanated from y=1. Moreover, this would serve to remove the unnecessary orange bars, which represent the values against which the other bars are normalised.
These two examples have the unfortunate property that some bars extend beyond the top of the scale. Another advantage of the logarithmic scale is that it handles unusually large (and unusually small) values gracefully.
This example is particularly unusual. It has the nice property of treating speedups and slowdowns symmetrically, but has a disconcerting discontinuity between y=-1 and y=1, and requires the reader to grapple with strange concepts like negative speedup factors! (From reading the surrounding text, I understand that a speedup factor of -2 should be interpreted as a 0.5x speedup, or a 2x slowdown.)
I’m a little torn about these two examples, both of which plot speedup against the degree of parallelism. In both cases, the advantage of plotting the speedup on a linear scale is that we can compare against the ideal situation of “speedup factor equals parallelism degree”, which is depicted as a straight line at y=x. Perhaps a reasonable compromise in these situations would be to use logarithmic scales for both axes.
Here is an assortment of other graphs from PLDI 2019 that plot normalised values on a linear scale.
I’d like to give an “honourable mention” to these two graphs, which do use logarithmic scales, but could be improved further by starting the bars at y=1 rather than at an arbitrary small number like y=0.01.
In praise of scatter plots

The problem with bar charts is that the x-axis tends to be underemployed. It has the capacity to represent some sort of quantity, but often is used simply to spread out a collection of benchmarks in some arbitrary order. I think that information can often be conveyed more effectively to the reader by using a scatter plot instead.

As an example, here is a classic bar chart, showing the performance of a “new” technique compared to an “old” technique over a range of benchmarks.

The graph is rather straightforward to read: when the green bar is higher, the old technique is faster, and when the red bar is higher, the new technique is faster. This graph provides all the information needed to compare the two techniques. But it is not easy to see “at a glance” which technique is better.

Here is an analogous scatter plot.

I reckon that scatter plots take longer to read than bar charts, but less time to understand. What I mean is: it’s immediately obvious from the bar chart that the height of a bar represents the time taken for that benchmark, but it takes a few moments to work out that in the scatter plot, the points above the diagonal represent benchmarks where the old technique is faster, and those below the diagonal represent benchmarks where the new technique is faster.

However, once this has been established, it becomes straightforward to compare the two techniques. One can immediately make observations like “the new technique seems to win on the shorter-running benchmarks, but to lose on the longer ones”.

It’s worth pointing out that my scatter plot shows less information than my bar chart, because it does not identify the individual benchmarks. Perhaps, if there are not too many benchmarks, it would be possible to label the points individually, or to use a different colour for each point. Of course, this risks overcomplicating things, and we are often more concerned with general trends than with the performance of particular benchmarks. A reasonable compromise might be to colour a handful of the particularly interesting points so that they can be referred to in the surrounding text.

By the way, I have used semi-transparent markers in my scatter plot. I find this quite an attractive way to deal with multiple points being almost or exactly on top of each other. With opaque markers, coincident points could get lost.

Scatter plots can cope with more adventurous situations too. For instance, here is a scatter plot that compares two variants of the “new” technique against the “old” one.

And here’s a scatter plot that conveys the uncertainty surrounding each data point using ellipses. The width of the ellipse corresponds to the uncertainty in the x-value, and the height of the ellipse corresponds to the uncertainty in the y-value. Ellipses that cross the y=x diagonal represent benchmarks where we’re not sure which technique is better.

Here are some examples of graphs from PLDI 2019 that might be better as scatter plots.

These three graphs all use a logarithmic scale for normalised data, thus meeting the first criterion in this post, but all have under-employed x-axes that are only being used to spread out benchmarks in an arbitrary order.
This example is quite interesting – here, the benchmarks are not in an arbitrary order along the x-axis; they are in descending order of their y-values. However, I think showing “time taken by str.KLEE” against “time taken by vanilla.KLEE” on a scatter plot would be more informative.
These four examples all use linear scales, but this is fine because they’re not plotting normalised values.
These four examples all use scatter plots, and appropriate scales. My only suggestion here is that they could all benefit from semi-transparent data points, so that overlapping points don’t get lost.
This final example is a scatter plot, with appropriate scales, and with no need for semi-transparent data points. There’s not much to fault here; my only suggestion is to use exactly the same scales for the x-axis and the y-axis, so that the y=x diagonal crosses from the bottom-left corner to the top-right corner.
Further reading
  • LaTeX code for the graphs drawn by me is available.
  • A 1983 article “On graphing rate ratios” in the American Journal of Epidemiology argues that relative rates should be plotted on logarithmic rather than linear scales. A counterpoint is provided by James R. Hebert and Donald R. Miller’s 1989 article “Plotting and discussion of rate ratios and relative risk estimates” in the Journal of Clinical Epidemiology, which argues that relative rates should actually be plotted not on logarithmic scales but on reciprocal scales!
  • The SIGPLAN Empirical Evaluation Checklist suggests logarithmic scales for speedups, among several other recommendations for drawing good graphs.
Screen Shot 2016-08-06 at 20.09.15
wicko3
http://johnwickerson.wordpress.com/?p=2032
Extensions
In praise of scatter plots
Computer ScienceMathematicsUncategorized
A lot of papers include a graph that benchmarks the performance of a new technique against a technique from previous work. Such a graph might look like this: The graph is rather straightforward to read: when the green bar is higher, the old technique is faster, and when the red bar is higher, the new… Continue reading In praise of scatter plots →
Show full content

A lot of papers include a graph that benchmarks the performance of a new technique against a technique from previous work.

Such a graph might look like this:

Screenshot 2019-10-22 at 09.48.45.png

The graph is rather straightforward to read: when the green bar is higher, the old technique is faster, and when the red bar is higher, the new technique is faster. This graph provides all the information needed to compare the two techniques. However, it is not easy to see “at a glance” which technique is better.

The fundamental problem, as I see it, is that the x-axis is being underutilised. Where a bar is placed along the x-axis does not matter. As long as the bars are spaced out enough, we do not mind where they sit. We could put the benchmarks in any order, and the graph would be fundamentally the same. This is in stark contrast to the y-axis: the position of the top of each bar on the y-axis is very important: it relates directly to the time taken to run that benchmark.

Enter the scatter plot

We can make better use of the x-axis by drawing a scatter plot:

Screenshot 2019-10-22 at 10.01.47.png

I reckon that scatter plots take longer to read than bar charts, but less time to understand. What I mean is: it’s immediately obvious from the bar chart that the height of a bar represents the time taken for that benchmark, but it takes a few moments to work out that in the scatter plot, the points above the diagonal represent benchmarks where the old technique is faster, and those below the diagonal represent benchmarks where the new technique is faster.

However, once this has been established, it becomes straightforward to compare the two techniques. One can immediately make observations like “the new technique seems to win on the shorter-running benchmarks, but to lose on the longer ones”.

It’s worth pointing out that my scatter plot shows less information than my bar chart, because it does not identify the individual benchmarks. Perhaps, if there are not too many benchmarks, it would be possible to label the points individually, or to use a different colour for each point. Of course, this risks overcomplicating things, and we are often more concerned with general trends than with the performance of particular benchmarks. A reasonable compromise might be to colour a handful of the particularly interesting points so that they can be referred to in the surrounding text.

By the way, I have used semi-transparent markers in my scatter plot. I find this quite an attractive way to deal with multiple points being almost or exactly on top of each other. With opaque markers, coincident points could get lost.

Benchmarking three or more techniques

A lot of papers do not have such a straightforward comparison between one “old technique” and one “new technique”. There may be several “old techniques” with different strengths that are worth comparing against. And the paper may be proposing multiple “new techniques” that all need evaluating.

To represent this kind of situation, here is a bar chart that compares two new techniques against one old technique.

Screenshot 2019-10-22 at 10.09.57.png

The problems of the first bar chart have been compounded. It is now even more difficult to work out which of the three techniques is best overall.

Happily, scatter plots can come to our rescue again.

Screenshot 2019-10-22 at 10.16.37.png

Here, we compare the old technique against both the first new technique (green circles) and the second new technique (blue squares).

As for comparing the two new techniques, this is not completely straightforward – one can observe that the green circles extend both a bit above and a bit below the blue squares, but little else.

Plotting relative values

Often, we do not care too much about the absolute time taken for each benchmark, only the relative time taken compared to the old technique. So, one quite commonly sees graphs like this one, where all the times are divided by the “old” time:

Screenshot 2019-10-22 at 10.25.23.png

Indeed, here is an example of just such a graph, from a paper published in ASPLOS 2016:screen-shot-2016-08-03-at-10-39-45.png

Of course, my graph (and the ASPLOS graph too) is rather redundant because all the red bars have a height of exactly 1! Better would be to use a log scale on the y-axis, and then to draw the green/blue bars as ascending or descending from the y=1 line. A log scale is more appropriate than a linear scale for relative values anyway, as I’ve argued in a previous blog post. This leads to the following graph:

Screenshot 2019-10-22 at 10.27.17.png

It’s now straightforward to spot which benchmarks have been slowed down by our new techniques, and which have been sped up. But the general trend is still not easy to see. So we turn to another scatter plot:

Screenshot 2019-10-22 at 10.29.17.png

Here, the x-axis shows the relative time taken by the first new technique, and the y-axis shows the relative time taken by the second new technique. The diagonal line compares the first new technique against the second new technique (points above the diagonal represent benchmarks where the first new technique is faster). The vertical line at x=1 compares the first new technique against the old technique (points to the left of the line represent benchmarks where the first new technique is faster). The horizonal line at y=1 compares the second new technique against the old technique (points below the line represent benchmarks where the second new technique is faster).

Including error bars

There may be some uncertainly in the values that we wish to plot. On a bar chart, this uncertainty is typically shown using error bars. Error bars can be drawn on scatter plots too, but once we have error bars stretching horizontally and vertically from each point (to show the uncertainty along both dimensions), the picture quickly becomes unreadable. One way to show uncertainty on a scatter plot is by varying the size of each data point, so that the larger the data point, the more uncertainty we have about its position. More precisely: we can draw each data point as an ellipse whose width corresponds to the uncertainty along the x-axis, and whose height corresponds to the uncertainty along the y-axis. As an example, the first scatter plot in this post, with additional uncertainty-ellipses, might look like this:

Here, we see that the uncertainty in the measurements means that for two of the benchmarks, we’re not sure which of the two techniques wins.

In conclusion, when you are planning your next “performance comparison” figure, and especially if you are using an unordered set of benchmarks, why not consider using a scatter plot?

NB: LaTeX code for all of the figures above is available.

Screenshot 2019-10-22 at 10.09.57
wicko3
Screenshot 2019-10-22 at 09.48.45.png
Screenshot 2019-10-22 at 10.01.47.png
Screenshot 2019-10-22 at 10.16.37.png
Screenshot 2019-10-22 at 10.25.23.png
screen-shot-2016-08-03-at-10-39-45.png
Screenshot 2019-10-22 at 10.27.17.png
Screenshot 2019-10-22 at 10.29.17.png
http://johnwickerson.wordpress.com/?p=1968
Extensions
Loop invariants – where should we put them?
Uncategorized
Many verification-oriented programming languages have built-in support for attaching loop invariants to loops. A loop invariant is a condition that holds in four different places: immediately before the loop, at the start of each iteration (before evaluating the test condition), at the end of each iteration, and immediately after the loop. Usually, the invariants are… Continue reading Loop invariants – where should we put them? →
Show full content

Many verification-oriented programming languages have built-in support for attaching loop invariants to loops. A loop invariant is a condition that holds in four different places:

  • immediately before the loop,
  • at the start of each iteration (before evaluating the test condition),
  • at the end of each iteration, and
  • immediately after the loop.

Usually, the invariants are written just after the test condition. For example, in Dafny you’d write this:

while i < n
   invariant i <= n
{
   i := i + 1;
}

and in VCC you’d write this:

while (i < n)
   _(invariant i <= n)
{
   i = i + 1;
}

and in Whiley you’d write this:

while i < n
   where i <= n:
   i = i + 1

But are these languages actually putting these loop invariants in the best place?

Well, perhaps not. The thing is, the loop invariants are required to hold at the beginning of every iteration, before the test condition is evaluated. (If the test condition holds, then the loop invariants must then be reestablished by the end of that iteration.) So writing the invariants after the test condition, as all of the languages above do, is really rather misleading. For instance, in the code above, it looks like we should know that i < n holds by the time we evaluate the invariant. In fact we only know that i <= n holds.

One alternative would be to write the invariants before the test condition, like this:

invariant i <= n
while i < n
{
   i := i + 1;
}
_(invariant i <= n)
while (i < n)
{
   i = i + 1;
}
where i <= n
while i < n:
   i = i + 1

which I think is perfectly acceptable. Another alternative would be to write the invariants at the end of the loop body, like this:

while i < n
{
   i := i + 1;
} invariant i <= n;
while (i < n)
{
   i = i + 1;
}
_(invariant i <= n)
while i < n:
   i = i + 1
where i <= n

which I also find inoffensive (though I think it works less well with the Python-style indentation rules that Whiley uses).

Ultimately, I think there’s no single place that is truly ideal for writing invariants. The thing is, as I mentioned above, loop invariants hold in four different places. So unless we write the same invariant in all four places, or we start writing programs as flowcharts rather than structured text, the user will need to do some mental gymnastics to remember exactly where they hold.

But perhaps these mental gymnastics can be made a little easier by moving the invariants to a slightly better position.

Screenshot 2019-10-04 at 16.21.20
wicko3
http://johnwickerson.wordpress.com/?p=1952
Extensions
Modulo scheduling with rational initiation intervals
Computer Science
It was great to have Patrick Sittel visit our group earlier this year. This blog post is about the work he and I did during his visit, which has been accepted as a paper (co-authored with Martin Kumm and Peter Zipf) at ASP-DAC 2020. Suppose we wish to fit a row of tiles to a… Continue reading Modulo scheduling with rational initiation intervals →
Show full content

It was great to have Patrick Sittel visit our group earlier this year. This blog post is about the work he and I did during his visit, which has been accepted as a paper (co-authored with Martin Kumm and Peter Zipf) at ASP-DAC 2020.

Suppose we wish to fit a row of tiles to a bathroom wall in a rather particular way. Our endeavour can be broken down into the following five tasks, each of which must be performed on each tile.

  • Task 1. Collect a tile and fix it to the next position on the wall.
  • Task 2. Select any colour except the final (i.e. post-varnishing) colour of the previous-but-one tile.
  • Task 3. Paint the tile with the chosen colour.
  • Task 4. Apply a coat of varnish to the tile.
  • Task 5. Sprinkle the tile with glitter.

Let us assume that each task takes exactly one minute. If we hire just one worker, we would expect them to be able to process one tile every five minutes. But if we can afford to hire more workers, can we get the job done faster?

We can, but before we start them all working at once, we need to be careful about which tasks depend on each other. For instance, we cannot sprinkle the tile with glitter (Task 5) until we have varnished it (Task 4), or else the glitter might not stick to the tile. We cannot varnish the tile until it has been painted (Task 3). And we cannot paint the tile until we have collected the tile (Task 1) and selected a colour for it (Task 2). The most subtle dependency is between Task 2 and Task 4: we cannot select a colour for the current tile until we have varnished the last-but-one tile, since varnishing a tile might alter its colour.

We can depict all these dependencies in the following diagram.

Screenshot 2019-09-05 at 10.17.44

The arrow labelled “2” means that Task 2 depends on Task 4 being completed from two tiles ago. The other arrows represent dependencies between tasks on the same tile. (We can imagine that they are labelled with “0”.)

Now, suppose we have three workers available. How should we allocate tasks to them, and in which order? The table below shows one possible schedule for the workers.

Screenshot 2019-09-05 at 10.51.00

Worker A alternates between Task 1 and Task 3, Worker B alternates between Task 2 and Task 4, and Worker C is responsible only for Task 5. Note that each tile has its tasks done in an appropriate order: Task 1 and Task 2 are done simultaneously, and then Task 3, Task 4, and Task 5 follow in sequence. And crucially, note that we do not begin Task 2 until Task 4 has been completed on the previous-but-one tile. For instance, note that Task 2 on tile 5 is scheduled after Task 4 on tile 3.

On the left-hand side of the table, a little blue triangle indicates the times when a new tile is collected. We see that this happens every two minutes. So, the benefit of trebling our workforce is that we can improve our tiling rate from 0.2 tiles per minute to 0.5 tiles per minute — we are now 2.5 times faster!

We’re not thrilled by this outcome, however. From a 3x increase in workforce, we might hope for a 3x increase in productivity! The crux of the problem is obvious enough: Worker C is scheduled to do nothing half of the time. Can we do anything to improve this situation? It’s not obvious that we can, given the awkwardness of trying to split five tasks evenly between three workers.

It turns out that we can do better if we allow our tiling rate to fluctuate over time. Here is a schedule that does that.

Screenshot 2019-09-05 at 11.13.29

In this schedule, each tile is processed by just one worker – for instance, tile 1 has all of its tasks performed by Worker A. Except for the first few minutes, all three workers are kept busy the whole time (unlike in our first schedule). And unlike our first schedule, the rate at which tiles are collected is not constant – as shown by the blue triangles, the interval between tile collections is two minutes, then two minutes, then one minute, and then the pattern repeats. In other words, this schedule is able to process three tiles every five minutes, or 0.6 tiles per minute. We are now 3x faster than the 0.2 tiles per minute that we managed with just one worker, just as we would hope.

Now, our paper is not actually about fixing tiles to walls, but about scheduling operations when designing a piece of hardware. Traditional hardware design tools limit themselves to schedules where the interval between operations is a fixed integer, as in the first schedule above. In our work, we present a tool that also considers schedules where this interval fluctuates, as in the second schedule above. We call these intervals “rational” because the average among these fluctuations can be interpreted as a rational number. We show that, just like we saw in our tiling example, this additional flexibility can lead to the discovery of better schedules that keep all the hardware components busy all of the time, and ultimately, lead to faster hardware.

Screenshot 2019-09-05 at 11.13.29
wicko3
Screenshot 2019-09-05 at 10.17.44
Screenshot 2019-09-05 at 10.51.00
http://johnwickerson.wordpress.com/?p=1943
Extensions
How to draw block diagrams
AcademiaComputer ArchitectureComputer Science
A huge number of academic papers, particularly in the fields of computer systems/architecture, use some sort of block diagram to give readers an overview of the solution being presented. For instance, about two thirds of the papers presented this year at ASPLOS contained at least one of these diagrams, usually towards the start of the paper.… Continue reading How to draw block diagrams →
Show full content

A huge number of academic papers, particularly in the fields of computer systems/architecture, use some sort of block diagram to give readers an overview of the solution being presented. For instance, about two thirds of the papers presented this year at ASPLOS contained at least one of these diagrams, usually towards the start of the paper. Some of these are shown below:

These block diagrams typically consist of blocks that represent actions and arrows that are labelled with the inputs and outputs that are passed around between those actions. They can be quite eye-catching, and often appear to be rather easy to understand, which explains why they are used so widely.

In this blog post, I will explore some guidelines for drawing these diagrams so that they really are easy to understand.

First, it is worth noting that a block diagram is a form of (directed) bipartite graph. That is, it can be seen as a graph with two types of vertices: action vertices and data vertices. Each edge in the graph connects an action vertex to a data vertex (or vice versa); there are no edges from actions to actions or from data to data.

This leads us to the first way to draw a block diagram: explicitly as a bipartite graph, as in the following diagram for a very simple compiler.Screenshot 2019-08-08 at 15.48.51

Note the importance of clearly distinguishing the two different types of vertex. Some of the ASPLOS diagrams above did not do this, which makes them unnecessarily hard to understand.

The second way is to remove the action vertices, and instead to have the actions appear as labels on arrows between data vertices, as shown below. The advantage of this representation is that we no longer need two different styles of vertex.

Screenshot 2019-08-08 at 15.48.57

The third way is to remove the data vertices, and instead to have the data appear as labels on arrows between action vertices, as shown below. This style also avoids the need for two different styles of vertex, though it has the slight disadvantage that one is never quite sure whether the first and last labels should be drawn above the arrows (as I have done here) or just beyond their tips. Which style of diagram is chosen may depend on whether one wishes to emphasise the sequence of actions that a system takes, or the data that it operates on.

Screenshot 2019-08-08 at 15.49.03

The problem with these latter two representations is that they don’t scale very well to more complicated designs. For instance, suppose we start with the following diagram of a more complicated compiler.

Screenshot 2019-08-08 at 15.49.09

If we try to draw this diagram in the “data-oriented” style, it becomes tricky to place the labels of actions that consume multiple pieces of data. See, for instance, the awkward “frontend” label in the picture below.

Screenshot 2019-08-08 at 15.49.15On the other hand, if we try to draw this diagram in the “action-oriented” style, it becomes tricky to deal with pieces of data that are consumed or produced by multiple actions, such as the “IR” data in the picture below. Screenshot 2019-08-08 at 15.49.21

To conclude: action-oriented and data-oriented block diagrams are fine in simple cases, and can lead to simpler diagrams than those that make the bipartite structure explicit. But when there are bifurcations in the control flow or data flow of a system, it may be better to resort to the explicit bipartite representation.

 

Screenshot 2019-08-08 at 15.49.09
wicko3
http://johnwickerson.wordpress.com/?p=1907
Extensions
A wordless proof
Mathematics
Show full content

parallelogram

parallelogram
wicko3
http://johnwickerson.wordpress.com/?p=1884
Extensions
A rant about fire doors
Miscellaneous
UK housebuilding specifications require all doors that lead from a habitable room to the outside to meet certain “fire door” criteria. These include having a self-closing mechanism, since fire doors are only effective when they are closed. Housebuilders typically implement this specification using a simple and discreet spring that pulls the door closed. This is… Continue reading A rant about fire doors →
Show full content

UK housebuilding specifications require all doors that lead from a habitable room to the outside to meet certain “fire door” criteria. These include having a self-closing mechanism, since fire doors are only effective when they are closed.

Housebuilders typically implement this specification using a simple and discreet spring that pulls the door closed.

This is stupid.

The first thing any occupier of the house is going to do is to get a wedge to keep the door open. This is partly because it’s a pain having to open doors all the time, and partly because spring-loaded doors are quite hazardous to any pets and small children living in the house. And as soon as the door is wedged open, it becomes completely useless as a fire door! At least if it didn’t have a self-closing mechanism, then the fire door might at least sometimes be closed. The addition of the self-closing mechanism only serves to guarantee that the door will never be closed!

In my opinion, the specification needs amending in two ways. First, the self-closing mechanism must be activated when, and only when, the fire alarm goes off. This can be implemented using a door catch that is powered by an electromagnet that is connected to the fire alarm circuit. Second, the door closer must close the door in a controlled way, without slamming, so as not to be hazardous to pets and small children.

If the self-closing mechanism does not conform to both of these requirements then any occupier of the house is simply going to use a door wedge.

PERKO-POWER_controlled_concealed_jamb-mounted_door_closer
wicko3
http://johnwickerson.wordpress.com/?p=1881
Extensions
Never Say Never Again – chords
Music
The music for this little-known James Bond song doesn’t appear in any of the official songbooks, or indeed anywhere else on the internet as far as I can tell. So here’s my attempt to rectify that. Never Say Never Again (1983) Music: Michel Legrand Lyrics: Lani Hall ----------------------------- Intro: Cm, F, etc. Cm F You… Continue reading Never Say Never Again – chords →
Show full content

The music for this little-known James Bond song doesn’t appear in any of the official songbooks, or indeed anywhere else on the internet as far as I can tell. So here’s my attempt to rectify that.

Never Say Never Again (1983)
Music: Michel Legrand
Lyrics: Lani Hall
-----------------------------

Intro: Cm, F, etc.

Cm            F
You walk in a room
Cm          F
A woman can feel the heat
Cm            F
One look is a guarantee
Cm              F
Nights could be long and sweet
Cm             F
The message is clear
Cm                F
Like nothing I've ever known
Ebm                 Ab
But the more that I hear
Ebm          Ab
Forget about long-range plans
Ebm         Ab    Ebm     F#
'Cause this man's got his own
   G   Ab    F#m       B            E
To get mixed up with a man who says never
Em         A            Dmaj7
May be big trouble, but then
       Ebm          Ab       Dbmaj7
I just could be the woman to take you
Ddim                           Cm
And make you never say never again
                        F
Never, never say never again
                        Cm
Never, never say never again
                        F
Never, never say never again
                        Cm
Never, never say never again
                   F
You've got all the moves
Cm                F
Ah, but baby I've got them too
Cm             F
No matter your attitude
Cm            F
Or your mood, I'll come through
C#m               F#
The touch of your voice
C#m              F#
The feel of your eyes on me
Em              A
You leave me no choice
Em                    A
Though I know there's danger there
Em            A      G
I don't care, let it be

   Ab  A  Gm         C            F
To get it bad with a man who says never
Fm          Bb          Ebmaj7
May have no future, but then
       Em           A        Dmaj7
I just could be the woman to reach you
Ebdim                             C#m
And teach you to never say never again

I'll beg you
     F#
I'll get you
     C#m
I'll reach you
     F#
I'll teach you
     C#m
I'll take you
     F#
I'll make you
nsnacreamsuit
wicko3
http://johnwickerson.wordpress.com/?p=1872
Extensions
Greatest hits of PLDI 2018
AcademiaComputer ArchitectureComputer ScienceFPGAsProgramming LanguagesUncategorized
I had a great time at PLDI 2018 last week. Here is my take on a few of the papers that stood out for me. John Vilk presented a tool called BLeak for finding memory leaks in web browsers. One might think that leak detection is not important in a garbage-collected setting, but Vilk explained… Continue reading Greatest hits of PLDI 2018 →
Show full content

I had a great time at PLDI 2018 last week. Here is my take on a few of the papers that stood out for me.

  • John Vilk presented a tool called BLeak for finding memory leaks in web browsers. One might think that leak detection is not important in a garbage-collected setting, but Vilk explained that actually it is. There may be situations where a data structure is growing out of control, but it cannot be garbage-collected because there is still a reference to it somewhere. Vilk’s tool is built on the observation that if a browsing session repeatedly returns to the same visual state, and if every time the browser returns to that state, its total memory usage has increased, then there is probably a memory leak. Vilk used his tool in this way to identify several memory leaks in popular websites, including Airbnb. A limitation in the utility of the tool at the moment is that the user must give the tool a browsing session that repeatedly returns to the same visual state — it would be nice if the tool could discover these sessions by itself, for then it could truly be set loose on the worldwide web!
  • Sam Baxter presented a tool called Stopify that automatically instruments Javascript code with periodic “yield” instructions that return control to the browser. The downside of doing so is a reduction in performance; the upside is an improvement in responsiveness — no longer will a long-running Javascript function hang your browser. Baxter proposed that his tool can be usefully employed by the various web services that allow programmers to write and test Javascript programs online (or indeed any language that can compile to Javascript), to ensure that executing badly-written code does not lead to the browser tab having to be killed. In the Q&A session, Baxter clarified that there is indeed an interesting trade-off to be made between inserting enough “yield” instructions to ensure responsiveness, and inserting so many that performance is ruined.
  • Ivan Radiček and Ke Wang both presented papers that tackle the problem of how to automatically generate feedback for students who have made an incorrect attempt at a programming exercise. This is an increasingly important problem now that we have online courses in which a single instructor aims to teach thousands of students to program. Both papers hinge on the insight that in a large class, many of the solutions will be correct. So, given an incorrect solution, the task is to find the syntactically-closest of the correct solutions, and generate a “diff” that explains to the student how their solution could be fixed to make it correct. The task is complicated by the fact that each solution might use different names for the same variable, so any comparison between programs must be done up to alpha-equivalence. These rather atypical papers generated quite a few questions from the audience, including whether the approach could be extended to give feedback not just about correctness issues, but also about issues related to good programming style. I wonder whether their technique could be adapted to become another useful tool in the teacher’s toolbelt: a plagiarism detector.
  • Stephen Dolan et al.‘s paper tackled the question of what guarantees programming languages should make about what will happen when a program contains a data race (i.e. two accesses to the same memory location, at least one of them a write, without proper synchronisation in between). Current languages such as C/C++ make no guarantees for racy programs. This is good for compiler-writers because it means that they only have to make their optimisations work correctly on race-free programs, but it’s bad for programmers wanting to reason about their code. The problem is that any data race — even in some library code that was written long ago — prevents any reasoning about any part of the rest of the program, and Dolan et al. argue that this is unsatisfactory. Their new semantics ensures that if some part of the program is free from races, then at least that part can be reasoned about. This is a pretty radical departure from how current languages work — perhaps too radical for the proposal to be adopted any time soon — but it seems eminently sensible, even if it comes at the cost of making the compiler-writers’ lives a tiny bit harder.
  • Thomas Wahl presented a paper with Peizun Liu about the static analysis of concurrent programs. Their paper introduced me to an interesting result established in 2001 by Ramalingam: that static analysis in a language that features both recursive procedures and inter-thread synchronisation is undecidable — even in the absence of any program variables. Ramalingam showed that the problem of deciding whether a given program point is reachable or not can be reduced to Emil Post’s “correspondence” problem, which is a classic undecidable problem. Liu and Wahl’s paper proceeds to develop a static analysis for such programs all the same, rightly undeterred by the fact that their analysis will never be able to handle all such programs.
  • Matthew Milano presented a language, called MixT, for programming distributed databases. The key idea is to allow each variable to be assigned a different consistency mode — whether that is a strong mode like “serializable” (in which database updates are immediately propagated to all replicas), a weak mode like “eventually consistent” (in which updates will eventually reach all replicas but possibly after a noticeable delay), or somewhere in between like “causally consistent”. Traditionally, consistency modes are associated with the whole database, and this makes it rather fiddly to program a system where some data needs strong consistency and other data only needs weak consistency. Milano et al.’s proposal reminded me a lot of the C++ memory model, in which each atomic operation can be associated with a different consistency mode (anywhere between “sequentially consistent” and “relaxed”). I wondered whether their work could be adapted to the C++ context, where it might ultimately lead to a more usable memory model for C++.
  • Magnus Själander presented SWOOP, an approach that combines compiler optimisation and architectural support to hide the latency of cache misses. The traditional way to hide memory latency is to employ out-of-order execution, but this complicates the design of processors, and hence makes them more power-hungry. SWOOP’s approach, which applies to code in a loop, works as follows. Execution proceeds as normal until a cache miss occurs. Via architectural support, this triggers the program to jump into an alternative execution mode, in which the first parts of the subsequent loop iterations begin executing. A compilation step has already identified which part of each loop iteration is independent of previous iterations, so executing ahead like this does not actually involve speculation — in other words, that code was going to execute anyway. Finally, the access to main memory is completed, and the remaining parts of each partially-executed loop iteration are executed.
  • David Koeplinger presented Spatial, a new language (and associated compiler) for designing FPGA-based accelerators. The motivation is that hardware languages like Verilog are too low-level for productive coding, and that high-level languages like C or OpenCL do not offer the right abstractions (and hence require clumsy pragmas all over the place). Spatial is essentially a high-level language, in that it supports structured programming (nested loops, etc.) and does not handle such low-level details as clock ticks. However, unlike C/OpenCL it gives programmers explicit control over where each variable should live (register, on-chip RAM, off-chip RAM, etc.). Moreover, it has built-in syntax for design space exploration, which means that the compiler will automatically explore a range of values for certain constants in the program, to discover which combination leads to the best-performing hardware. Spatial is clearly a promising language, as evidenced by its impressive speedups over tools like SDAccel. Myself, I would be interested to understand whether the various features in the Spatial language can be teased apart, and their impact on performance measured separately.
  • Kostas Ferles presented a tool called Expresso, which takes multi-threaded code that uses implicit signalling between threads, and turns it into equivalent code that uses explicit signalling. To expand on this: implicit signalling is where code is written using structures like critical regions, which simply mark “this piece of code must be executed in isolation”. The runtime is responsible for waking up other threads after a critical region has finished. With explicit signalling, the programmer is responsible for writing the code to wake up other threads. Explicit signalling can be more performant than implicit signalling because the programmer can invoke domain-specific knowledge to decide whether it’s better to wake up just a specific thread, or all threads, and can time the wake-up at just the right moment. However, explicit signalling is error prone. Thus, Expresso aims to offer the best of both worlds: the programming ease of implicit signalling and the performance of explicit signalling. I wonder whether Expresso could potentially be adapted from being a tool that can generate new explicit-signalling code, into being a tool that can verify the correctness of existing explicit-signalling code.
IMG_4590
wicko3
http://johnwickerson.wordpress.com/?p=1863
Extensions