GeistHaus
log in · sign up

Gowers's Weblog

Part of wordpress.com

Mathematics related discussions

stories
A recent experience with ChatGPT 5.5 Pro
ComputingStraight mathsaimathematics
We are all having to keep revising upwards our assessments of the mathematical capabilities of large language models. I have just made a fairly large revision as a result of ChatGPT 5.5 Pro, to which I am fortunate to have been given access, producing a piece of PhD-level research in an hour or so, with […]
Show full content

We are all having to keep revising upwards our assessments of the mathematical capabilities of large language models. I have just made a fairly large revision as a result of ChatGPT 5.5 Pro, to which I am fortunate to have been given access, producing a piece of PhD-level research in an hour or so, with no serious mathematical input from me.

The background is that, as has been widely reported, LLMs are now capable of solving research-level problems, and have managed to solve several of the Erdős problems listed on Thomas Bloom’s wonderful website. Initially it was possible to laugh this off: many of the “solutions” consisted in the LLM noticing that the problem had an answer sitting there in the literature already, or could be very easily deduced from known results. But little by little the laughter has become quieter. The message I am getting from what other mathematicians more involved in this enterprise have been saying is that LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments, so it is still just about possible to comfort oneself that LLMs are merely putting together existing knowledge rather than having truly original ideas. How much of a comfort that is I will not discuss here, other than to note that quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques.

I decided to try something a little bit different. At least in combinatorics, there are quite a lot of papers that investigate some relatively new combinatorial parameter that leads naturally to several questions. Because of the sheer number of questions one can ask, the authors of such papers will not necessarily have the time to spend a week or two thinking about each one, so there is a decent probability that at least some of them will not be all that hard. This makes such papers very valuable as sources of problems for mathematicians who are doing research for the first time and who will be hugely encouraged by solving a problem that was officially open. Or rather, it used to make them valuable in that way, but it looks as though the bar has just been raised. It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.

In any case, a little over a week ago I decided to see how ChatGPT 5.5 Pro would fare with a selection of problems asked by Mel Nathanson in a paper entitled Diversity, Equity and Inclusion for Problems in Additive Number Theory. Nathanson has a remarkable record of being interested in problems and theorems that have later become extremely fashionable, which has led him to write a series of extremely well timed and therefore highly influential textbooks. In this paper, he argues for the interest of several other problems, some of which I will now briefly describe.

If A is a set of integers, then its sumset A+A is defined to be \{a+b:a,b\in A\}. For a positive integer h, the hfold sumset, denoted hA, is defined to be \{a_1+\dots+a_h: a_1,\dots,a_h\in A\}. Nathanson is interested in the possible sizes of hA given the size of A. To that end one can define a set \mathcal R(h,k) to be the set of all t such that there exists a set A with |A|=k and |hA|=t.

An obvious first question to ask is simply “What is \mathcal R(h,k)?” When h=2, the answer is the set of all integers between 2k-1 and \binom{k+1}2. It is an easy exercise to show that if |A|=k, then 2k-1\leq|A+A|\leq\binom{k+1}2, so this result is saying that all sizes in between can be realized. However, it is not true in general that hA can take every size between its minimum and maximum possibilities, and we do not currently have a complete description of \mathcal R(h,k).

Another natural question one can ask, and this is where ChatGPT came in, is how large a diameter you need if you want a set A with A and hA having prescribed sizes. (Of course, the size of hA must belong to \mathcal R(h,k).) Nathanson showed that for every t\in[2k-1,\binom{k+1}2] there is a subset A of \{0,1,2,\dots,2^k-1\} with |A|=k and |A+A|=t, and asked whether the bound 2^k-1 could be improved. ChatGPT 5.5 Pro thought for 17 minutes and 5 seconds before providing a construction that yielded a quadratic upper bound, which is clearly best possible. It wrote up its argument in a slightly rambling LLM-ish style, so I asked if it could write the argument up as a LaTeX file in the style of a typical mathematical preprint. After two minutes and 23 seconds it gave me that, after which I spent some time convincing myself that the argument was correct.

The basic idea behind both Nathanson’s argument and ChatGPT’s was that in order to obtain a set of a given size with a sumset of a given size, it is useful to build it out of a Sidon set, which means a set with sumset of maximal size (that is not quite the usual definition but it is the simplest to use in this discussion), and an arithmetic progression. Also, for a bit of fine tuning one can take an additional point near the arithmetic progression. Then if one plays around with the various parameters, one finds that one can obtain sets of all the sizes one wants. Nathanson doesn’t express his argument this way (it is Theorem 5 of this paper), instead giving an inductive argument, but I think, without having checked too carefully, that if one unravels his argument, one finds that effectively that is what he ends up with, and the Sidon set in question consists of powers of 2. ChatGPT obtained its improvement by simply using a more efficient Sidon set — it is well known that one can find Sidon sets of quadratic diameter. (One might ask why Nathanson didn’t do that in the first place: I think it is because the obvious idea of using a more efficient Sidon set becomes obvious only after one has redescribed his inductive construction. Is that what ChatGPT did? It is very hard to say.)

Next, I asked ChatGPT to see whether it could do the same for a closely related question, where instead of looking at the size of the sumset, one looks at the size of the restricted sumset, which is defined to be \{a+b:a,b\in A, a\ne b\}. Unsurprisingly, it was able to do that with no trouble at all. I got it to write both results up in a single note, to avoid a certain amount of duplication. If you are curious, you can see the note here.

I then asked what it could do for general h. I was much less optimistic that it would manage to do anything interesting, because the proof for h=2 makes fundamental use of the fact (due to Erdős and Szemerédi) that we know exactly which sizes we need to create. If we don’t know what the set \mathcal R(h,k) is, then it seems that we are forced to start with a hypothetical set A with |A|=k and |hA|=t and build out of it a set of small diameter with the same property. As it happens, I still don’t know how to get round that difficulty (I’m mentioning that just to demonstrate that my mathematical input was zero, and I didn’t even do anything clever with the prompts), but Nathanson mentioned in his paper a remarkable paper of Isaac Rajagopal, a student at MIT, who must have got round the difficulty somehow, because he had managed to prove an exponential dependence of \mathcal R(h,k) on k for each fixed h.

I’ll leave the previous paragraph there, but Isaac has subsequently explained to me that that isn’t really the difficulty. His argument gives a complete description of \mathcal R(h,k) when k is sufficiently large, and if one wants to prove a polynomial dependence for fixed h, then assuming that k is sufficiently large is clearly permitted. The real difficulty is that constructing the sets with given sumset sizes was significantly more complicated, and necessarily so because the degree of the polynomial grows with h, and one therefore needs more and more parameters to define the sets.

In any case, the task faced by ChatGPT was not to solve the problem from scratch, but to see whether it was possible to tighten up Isaac Rajagopal’s argument. Here’s what happened.

  1. After 16 minutes and 41 seconds, it came back with an argument that claimed to have improved the upper bound from exponential in k to exponential in k^\alpha for any \alpha>1/2.
  2. I asked it to write that in preprint form too, which took it a further 47 minutes and 39 seconds.
  3. That preprint would have been hard for me to read, as that would have meant carefully reading Rajagopal’s paper first, but I sent it to Nathanson, who forwarded it to Rajagopal, who said he thought it looked correct.
  4. Both ChatGPT and Rajagopal speculated a little on what might need to be done to push things further and get a polynomial bound, so I got greedy and asked ChatGPT to give that a go.
  5. After 13 minutes and 33 seconds it told me it felt optimistic about the existence of such an argument but there were a couple of technical statements that needed checking.
  6. I asked it to check them.
  7. After 9 minutes and 12 seconds it got back to me with the check having been done, so I asked for this too to be written in preprint form.
  8. After 31 minutes and 40 seconds the “preprint” was ready. Here it is.
  9. Isaac Rajagopal looked at it and declared it to be almost certainly correct. It was clear that he meant this not just at a line-by-line level but at the level of ideas.

Isaac made some very interesting remarks about the nature of what the additional ideas were that ChatGPT contributed. Since, as I have already said, my mathematical input was zero, I invited him to write a guest section to this post. Just before we get to that, I want to raise a question (that will undoubtedly have been raised by others as well), which is simple: what should we do with this kind of content? Had the result been produced by a human mathematician, it would definitely have been publishable, so I think it would be wrong to describe it as AI slop. On the other hand, it seems pointless even to think about putting it in a journal, since it can be made freely available, and nobody needs “credit” for it (except that Isaac deserves plenty of credit for creating the framework on which ChatGPT could build). I understand that arXiv has a policy against accepting AI-written content, which makes good sense to me. So maybe there should be a different repository where AI-produced results can live. But various decisions would need to be made about how it was organized. I myself think that one would probably want to have some kind of moderation process, so that results would be included only if a human mathematician was prepared to certify that they were correct — or, better still, that they had been formalized by a proof assistant — and perhaps also that they answered a question that had been asked in a human-written paper. On the other hand, I wouldn’t want a moderation process that created vast amounts of work (unless the work was itself done by AI, but there are obvious dangers in going down that route). Anyway, until these questions are answered, this result is available from the link above, and perhaps, now that LLMs are so good at literature search, that will be enough to make it findable by anyone who wants to know whether Nathanson’s problem has been solved.

Isaac’s evaluation of what ChatGPT achieved

With just a few prompts, ChatGPT was able to improve the upper bound on N(h,k) (which I will define very soon) from exponential in k to polynomial in k. While its first improvement of the bound, from exponential in k to exponential in k^{\frac{1}{2} + \varepsilon}, was a routine modification of my work, the improvement to polynomial in k is quite impressive. To do this, ChatGPT came up with an idea which is original and clever. It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove, using similar methods to those in my own proof. My goal is to explain that idea, in a manner that will be digestible to my friends who are computer science majors as well as my math major friends.

The problem of bounding N(h,k) is closely related to a problem I worked on at the Duluth REU (Research Experience for Undergrads) program, of determining \mathcal{R}(h,k). In particular, \mathcal{R}(h,k) is the set of possible h-fold sumset sizes |hA|, where A can be chosen to be any set of k integers. N(h,k) is the minimal N such that we can achieve all of the values of \mathcal{R}(h,k) using k-element sets A \subset \{0,1,2,\ldots,N\}. I spent last summer explicitly characterizing the set \mathcal{R}(h,k) for large k, by constructing sets A such that |hA| achieves all sizes which I could not rule out as impossible. So, N(h,k) can be upper-bounded by optimizing my constructions.

I constructed these sets A by combining smaller component sets which are simpler to analyze. Some of these components are the geometric series

\displaystyle S = \{0,1,m,m^2,\ldots,m^{\ell-2}\} \quad \hbox{and} \quad T = \{1,m,m^2,\ldots,m^{\ell-1}\} \qquad (1)

for various values of 2 \leq m \leq h and 2 \leq \ell \leq k. Unfortunately, the elements of S and T are exponentially large in terms of k. So, I asked ChatGPT (through Tim) whether there exist sets of \ell elements which have similar sumset sizes to these geometric series, but contain only numbers of polynomial size in \ell: I had no idea if this was possible, or how to begin constructing such sets. ChatGPT came back with an answer, constructing sets G and H which behave like “half a geometric series squeezed into a polynomial interval,” which is counterintuitive. Before I discuss the construction of G and H, I will explain the important properties of the sumset sizes of S and T which they recreate.

For h > 0, a set A is called a B_h set if the only solutions to

\displaystyle x_1+\cdots+x_h = y_1+\cdots+y_h

with x_i,y_i in A are the “trivial” solutions, by which I mean that one side of the equation is a reordering of the other side. If A is a B_h set of size \ell, then elements of hA correspond exactly to choices of h elements of A, with repetition allowed. Using “stars and bars,” one can see that |hA| = \binom{h+\ell - 1}{h} and this is the maximum possible value of |hA| among sets of size \ell. So, another definition is that A is a B_h set if |hA| = \binom{h+|A| - 1}{h}. Sidon sets, which Tim discussed, are exactly B_2 sets.

To make things more concrete, let us assume that m = 4 in (1). Then, S is a B_3 set, but it is not a B_4 set because of the relations

\displaystyle 4^{a} + 4^a + 4^a + 4^a = 4^{a+1} + 0 + 0 + 0 \qquad (2)

for any choice of a in \{0,1,2,\ldots, \ell-3\}. In particular, \binom{\ell+3}{4} - |4S| = \ell-2, as these \ell-2 relations are the only ones preventing S from being a B_4 set. T lacks the relations in (2) because 0 is not in T. So, T is a B_4 set, but it is not a B_5 set because of the relations

\displaystyle 4^{a} + 4^a + 4^a + 4^a + 4^{b+1} = 4^{a+1} + 4^b + 4^b + 4^b + 4^b \qquad (3)

for any choices of a \neq b in \{0,1,2,\ldots, \ell-2\}. This gives \binom{\ell-1}{2} relations, and one can check that \binom{\ell+4}{5} - |5T| = \binom{\ell-1}{2}. To summarize, we have seen that

(a) S is a B_{m-1} set.

(b) \binom{m+\ell-1}{m} - |mS| = \ell -2 is a linear function of \ell.

(c) T is a B_{m} set.

(d) \binom{m+\ell}{m+1} - |(m+1)T| = \binom{\ell-1}{2} is a quadratic function of \ell.

    ChatGPT was able to find sets G and H of \ell elements which satisfy (a)-(d), but whose elements all have polynomial size in \ell. The construction of G and H uses h^2-dissociated sets, which are sets A where the only solutions to

    \displaystyle x_1+\cdots+x_s = y_1+\cdots+y_{s'} \qquad (4)

    with s,s' \leq h^2 and x_i,y_i in A are the “trivial” solutions, i.e. s = s' and one side of the equation is a reordering of the other side. For r > 0, it is possible to construct an h^2-dissociated set U = \{u_1,\ldots,u_r\} \subseteq \{0,1,2,\ldots,N\}, where N is approximately r^{h^2}, and in particular polynomial in r. Constructions of such a U using finite fields date back to Singer (1938) and Bose–Chowla (1963) and are described in Appendix 1. Define

    \displaystyle G = \{0, u_1,u_2,\ldots,u_r,mu_1,mu_2,\ldots, mu_r\}

    and

    \displaystyle H= \{u_1,u_2,\ldots,u_r,mu_1,mu_2,\ldots, mu_r\}. \qquad (5)

    In hindsight, I have good intuition for the construction of G and H. All of the relations in (2) and (3) are formed by combining one or two relations of the form 4x = y. There are approximately \ell relations of the form mx = y in S and T, and approximately \ell/2 such relations in G and H. There are few other low-order relations in S and T, and similarly in G and H because U is h^2-dissociated. So, G and H manage to contain half as many mx = y-relations as their geometric series counterparts, while also containing few low-order relations.

    We now see why (a)-(d) hold with S and T replaced by G and H, respectively. For concreteness, we assume that m = 4 and h>4, so U contains no nontrivial relations as in (4) with s,s' \leq 25 \leq h^2. Then, G is a B_3 set, but it is not a B_4 set because of the relations

    \displaystyle u_i + u_i + u_i + u_i = 4u_i + 0 + 0 + 0

    for any choice of i in \{1,2,\ldots, r\}. If we let \ell = |G| = 2r+1, we can check that \binom{\ell + 3}{4} - |4G| = r = \frac{\ell-1}{2} is linear in \ell. In particular, (a) and (b) hold with S replaced by G, and the linear function \ell-2 replaced by \frac{\ell-1}{2}. We can also see that H is a B_4 set, but it is not a B_5 set because of the relations

    \displaystyle u_i + u_i + u_i + u_i + 4u_j = 4u_i + u_j + u_j +u_j + u_j

    for any i\neq j in \{1,2,\ldots, r\}. If we let \ell = |H| = 2r, we can check that \binom{\ell + 4}{5} - |5H| = \binom{r}{2} = \binom{\ell/2}{2} is quadratic in \ell. In a similar manner, (c) and (d) hold with T replaced by H, and the quadratic function \binom{\ell-1}{2} replaced by \binom{\ell/2}{2}.

    Even though I can motivate it in retrospect, ChatGPT’s idea to use h^2-dissociated sets to control relations of order at most h feels quite ingenious. As far as I can tell, this idea is completely original.

    ChatGPT’s proof that its construction produces the desired values of |hA| is very similar to my proof that the sets A which I construct achieve all possible values of |hA|, after replacing S and T by G and H, respectively. Properties (a)-(d) capture many of the important properties of S and T (or G and H) which are used in this proof. The final constructions involve combining the sets G and H (or S and T in my paper) for each value of m between 2 and h with another set which is the union of an arithmetic progression and a point. Intuitively, G and H (or S and T) have large sumsets, while arithmetic progressions have small sumsets, so it is plausible that one could get sets which achieve all the medium-sized sumsets by combining them. However, the proof of this is quite involved, and it occupies Section 4 of my paper and the entirety of the ChatGPT preprint. In Appendix 2, I work out the details of the ChatGPT construction to show that for k sufficiently large,

    \displaystyle N(h,k) \leq O\left(k^{10h^3}\right).

    For comparison, it is easy to see that N(h,k) is at least on the order of k^{h}, and it is unknown what the real value is. In Appendix 3, I give details of the correspondence between my paper and the ChatGPT preprint, which will be helpful for those who want to read either.

    Finally, I want to express my deep gratitude to Tim for allowing me to contribute to this blog. I am still stunned by the coincidence that the problem he chose to put into ChatGPT 5.5 Pro led him to my paper on the arXiv.

    Tim on what this means for mathematical research

    I would judge the level of the result that ChatGPT found in under two hours to be that of a perfectly reasonable chapter in a combinatorics PhD. It wouldn’t be considered an amazing result, since it leant very heavily on Isaac’s ideas, but it was definitely a non-trivial extension of those ideas, and for a PhD student to find that extension it would be necessary to invest quite a bit of time digesting Isaac’s paper, looking for places where it might not be optimal, familiarizing oneself with various algebraic techniques that he used, and so on.

    It seems to me that training beginning PhD students to do research, which has always been hard (unless one is lucky enough, as I have often been, to have a student who just seems to get it and therefore doesn’t need in any sense to be trained), has just got harder, since one obvious way to help somebody get started is to give them a problem that looks as though it might be a relatively gentle one. If LLMs are at the point where they can solve “gentle problems”, then that is no longer an option. The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.

    I would qualify that statement in two ways though. First, there is the obvious point that a beginning PhD student has the option of using LLMs. So the task is potentially easier than proving something that LLMs can’t prove: it is proving something in collaboration with LLMs that LLMs cannot manage on their own. I have done quite a lot of such collaboration recently and found that LLMs have made useful contributions without (yet) having game-changing ideas.

    A second point is that I don’t know how much of what I have said generalizes to other areas of mathematics. Combinatorics tends to be quite focused on problems: you start with a question and you reason back from the question or if you reason forwards you do so very much with the question in mind. In other areas there can be much more of an emphasis on forwards reasoning: you start with a circle of ideas and see where it leads. To do it successfully, you need to have some way of discriminating between interesting observations and uninteresting ones, and it isn’t obvious to me what LLMs would be like at that.

    Of course, everything I am saying concerns LLMs as they are right now. But they are developing so fast that it seems almost certain that my comments will go out of date in a matter of months. It is also almost certain that these developments will have a profoundly disruptive effect on how we go about mathematical research, and especially on how we introduce newcomers to it. Somebody starting a PhD next academic year will be finishing it in 2029 at the earliest, and my guess is that by then what it means to undertake research in mathematics will have changed out of all recognition.

    I sometimes get emails from people who are interested in doing mathematical research but are not sure whether that makes sense any more as an aspiration. I have a view on that question, but it may very well change in response to further developments. That view is that there is still a great deal of value in struggling with a mathematics problem, but that the era where you could enjoy the thrill of having your name forever associated with a particular theorem or definition may well be close to its end. So if your aim in doing mathematics is to achieve some kind of immortality, so to speak, then you should understand that that won’t necessarily be possible for much longer — not just for you, but for anybody. Here’s a thought experiment: suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.

    So what is the point of struggling with a difficult mathematics problem? One answer is that it can be very satisfying to solve a problem even if the answer is already known, but I don’t think that is a sufficient reason to spend several years of your life on this peculiar activity. A better answer is that by solving hard problems you get an insight into the problem-solving process itself, at least in your area of expertise, in a way that you simply don’t if all you do is read other people’s solutions. One consequence of this is that people who have themselves solved difficult problems are likely to be significantly better at using solving problems with the help of AI, just as very good coders are better at vibe coding than not such good coders, or people who have a solid grasp of how to do basic arithmetic are likely to be more skilled at using calculators (and especially at noticing when an answer feels off). Mathematics is a highly transferable skill, and that applies to research-level mathematics as well. By doing research in mathematics, you may not get the same rewards as your equivalents a generation ago, but there is a good chance that you will be equipping yourself very well for the world we are about to experience.

    Appendix 1 (Isaac)

    We will construct an h-dissociated set U = \{u_1,\ldots,u_r\} \subseteq \{0,1,2,\ldots,N\}, where N is approximately r^{h}. This construction is a very minor modification of Bose–Chowla (1963)’s construction of a B_h set, which I learned about from this paper. For whatever reason, the GPT preprint (Lemma 3.1) uses a different, less efficient construction using moment curves.

    Let p > r be a prime, let N = p^{h+1}-2, let K be the finite field with p^{h+1} elements and fix a generator \theta of K^\times, so that K^\times is equal to \{\theta^0,\theta^1,\ldots, \theta^N\}. Define a set of p elements

    \displaystyle U = \{a \in \{0,1,2,\ldots,N\}: \theta^a - \theta \in \mathbb{F}_p\}.

    Then, each element a \in U corresponds to a unique value of \tilde{a} \in \mathbb{F}_p, by taking \tilde{a} = \theta^a - \theta. Now an additive relation of the form in (4) with s,s' \leq h can be reframed by taking powers of \theta as

    \displaystyle (\theta + \tilde{x_1})(\theta + \tilde{x_2})\cdots (\theta + \tilde{x_s}) = (\theta + \tilde{y_1})(\theta + \tilde{y_2})\cdots (\theta + \tilde{y_{s'}}). \qquad (6)

    As K is a degree-h+1 extension of \mathbb{F}\sb{p} and \theta is a generator of K as an \mathbb{F}\sb{p}-extension, this means that \theta does not satisfy any nonzero polynomials in \mathbb{F}\sb{p}[x] of degree \leq h. So, both sides of (6) are identical as polynomials in \mathbb{F}_{p}[\theta] and thus the additive relation in (4) is trivial. So, U is h-dissociated, and of course one can prune a few elements to reduce U to size r.

    Appendix 2 (Isaac)

    Fix constants \alpha,\beta,\gamma such that 0.5 < \beta\gamma < \beta < \alpha < 1 (in my paper I arbitrarily chose (\alpha,\beta,\gamma) = (0.9,0.8,0.7)). Let the two sets in (5) be called G_{m,r} and H_{m,r}. Let [a,b] denote the set of integers x satisfying a \leq x \leq b. Similarly to my paper, the constructions of A such that hA achieves the desired sizes will combine sets of the following four types:

    • B_{j,b} := [0,b-2] \cup \{b-2+j\} with choices of b \in [3, k-k^\gamma] and j \in [1,hb].
    • G_{m,r_m} for each value of m \in [3, h], with choices of r_m \in [0, (k-b)^\alpha].
    • H_{m,u_m} for each value of m \in [2,h-1], with choices of u_m \in [0, (k-b)^\beta].
    • A B_h set of the correct size so that |A| = k.

    One reason that this construction needs to be complicated is that we need to create at least \Omega(k^h) many sets. To do this, we vary 2h-4 parameters r_m and u_m in the domain [0,k^\alpha] and 2 parameters b and j in the domain [1,hk]. We can choose \alpha to be slightly bigger than 1/2, and then the above construction gives us O(k^{\alpha(2h-4)+ 2})=O(k^{h + \delta}) different sets where \delta >0 can be made arbitrarily small. So, if we were to remove any of the above parameters from the construction, and not change the others, this construction would no longer create \Omega(k^h) many sets. In comparison, Nathanson’s construction when h=2 only needs to create \Omega(k^2) sets. He does this by combining a Sidon set, an arithmetic progression, and one extra value, and varying the size of the arithmetic progression and the extra value in ranges of size O(k).

    We want to combine q = 2h-2 sets A_1,\ldots,A_q, which are given by B_{j,b}, G_{m,r_m} for the h-2 values of m \in [3,h], H_{m,u_m} for the h-2 values of m \in [2,h-1], and a B_h set. By Appendix 1, for all r \leq k, there exists a h^2-dissociated set {u_{1},\ldots,u_{r}} of diameter M \leq r^{2h^2} \leq k^{2h^2}. By the constructions of G_{m,r_m} and H_{m,u_m}, we can take each A_i \subseteq [0,M], where M \leq hk^{2h^2}. Let \mathbb{Z}^{2q} have basis vectors e_1,\ldots,e_{2q}. To combine A_1,\ldots,A_q, we can define A \subseteq \mathbb{Z}^{2q} as

    \displaystyle A = \bigcup_{i=1}^q (A_i e_i + e_{q+i}) \subseteq \{0,1,2,\ldots,M\}^{2q} \subseteq \mathbb{Z}^{2q}.

    Similarly to my Lemma 4.9, this construction ensures that the generating function product \mathcal{F}_{A}(z) = \prod_{i=1}^q \mathcal{F}_{A_i}(z) holds, which is the identity that both my paper and the GPT preprint use (see either paper for a definition of these generating functions). By (the standard) Lemma 2.3 of the GPT preprint, A is Freiman-isomorphic of order h to a subset of [0,2qM(2hM)^{2q-1}]. Therefore, for k sufficiently large (the whole construction relies on this for the same reasons as in my paper),

    \displaystyle N(h,k) \leq 2qM(2hM)^{2q-1} \leq 2\left(2h^2k^{2h^2}\right)^{2(2h-2)} \leq k^{10 h^3}.

    Appendix 3 (Isaac)

    In Section 4.2 of my paper, I use a different, simpler construction to construct sets A achieving the values in \mathcal{R}(h,k) which have |hA| < \varepsilon k^h, for some small \varepsilon. These sets A are subsets of {0,1,2,\ldots,k^h}, meaning that all elements have polynomial size in k. This is observed in Section 5 of the GPT preprint.

    Section 4.3 of my paper carries out the construction which combines many components including S and T. This corresponds to Sections 2, 3, 4, and 6 of the GPT preprint. This section has a lot of moving parts; I give an outline in Section 4.3.1.

    In Section 4.3.2, I describe how the different components will be combined, using a construction which I call the disjoint union, and introduce generating functions \mathcal{F}_A(z) as a bookkeeping tool to keep track of the sumset sizes of a set A. This corresponds to Section 2 and Section 4 of the GPT preprint.

    In Section 4.3.3, I compute the generating function of each of the component sets, including \mathcal{F}_S(z) (Lemma 4.15) and \mathcal{F}_T(z) (Lemma 4.17). This corresponds to Section 3 and Section 6.1 of the GPT preprint. In particular, \mathcal{F}_{G}(z) is computed in Lemma 3.3 and \mathcal{F}_{H}(z) is computed in Lemma 3.4. Once these generating functions have been computed, the remainder of the proof is almost identical in my paper and in the GPT preprint.

    In Section 4.3.4, I put all the pieces together to show that as we range over the sets A which I have constructed, the values of |hA| will assume all of the elements of {\lceil\varepsilon k^h\rceil, \lceil\varepsilon k^h\rceil+1,\ldots ,\binom{h+k-1}{h} }. The key idea is to show that the set of all values of |hA| forms an interval, and contains numbers both smaller than \varepsilon k^h and equal to \binom{h+k-1}{h}.

    http://gowers.wordpress.com/?p=6666
    Extensions
    Group and semigroup puzzles and a possible Polymath project
    Mathematics on the internetpolymathmathematics
    An Artin-Tits group is a group with a finite set of generators in which every relation is of the form or for some positive integer , where and are two of the generators. In particular, the commutation relation is allowed (it is the case of the first type of relation) and so is the braid […]
    Show full content

    An Artin-Tits group is a group with a finite set of generators a_1,\dots a_k in which every relation is of the form (ab)^r=(ba)^r or (ab)^ra=(ba)^rb for some positive integer r, where a and b are two of the generators. In particular, the commutation relation ab=ba is allowed (it is the case r=1 of the first type of relation) and so is the braid relation aba=bab (it is the case r=1 of the second type of relation). This means that Artin-Tits groups include free groups, free Abelian groups, and braid groups: for example, the braid group on k+1 strands has a presentation with generators a_1,\dots,a_k, where a_i represents a twist of the ith and (i+1)st strands, and relations a_ia_j=a_ja_i if |i-j|>1 and a_ia_{i+1}a_i=a_{i+1}a_ia_{i+1}.

    A few weeks ago, I asked ChatGPT for a simple example of a word problem for groups or semigroups that was not known to be decidable and also not known to be undecidable. It turns out that the word problem for many Artin-Tits groups comes into that category: the simplest example where the status is not known is the group with four generators a,b,c and d where c and d commute and all other pairs of generators satisfy the braid relation xyx=yxy.

    My interest in this was initially that I was looking for a toy model from which one could learn something about how mathematicians judge theorems to be interesting. I started with a remarkable semigroup discovered by G. S. Tseytin that has five generators, seven very simple relations, and an undecidable word problem. From that I created a puzzle game that you might be interested to play. (NB the puzzles were not showing up in the public version, but that should be sorted out now.) Each puzzle is equivalent to an instance of the word problem for Tseytin’s semigroup, but the interface makes it much more convenient to change words using the relations than it would be to do it with pen and paper.

    I was hoping that because every mathematical problem can in principle be encoded as a puzzle in this game, one might be able to build a sort of “alien mathematics”, where a theorem was an equality between words, and a definition was a decision to introduce a new (and redundant) generator g, together with a relation of the form g = w, where w is a word in the current alphabet. Theorems would be particularly interesting if they were equalities between words that could be established only by a chain of equalities that went through much longer words, and definitions would be useful if the new generators satisfied particularly concise relations (which would allow one to build “theories” within the system). I still hope to find a word problem that will allow such a project to take off, but in the end, after a lot of playing around with the game linked to above, I have decided that Tseytin’s semigroup is not suitable. The reason is that it is designed so that arbitrary instances of the word problem for groups can be encoded as word problems in this semigroup, and once one gets used to the game, one starts to see how that encoding can work. Furthermore, one seems to be driven towards the encoding — I don’t get the impression that there’s a whole other region of this semigroup to explore that has nothing to do with the kinds of words that come up in the encoding. And if that impression is correct, then one might as well start with the word problem for some group in unencoded form, or alternatively look for another semigroup. Nevertheless, I find the Tseytin game quite enjoyable: I won’t say more about it here but have written a fairly comprehensive tutorial that you can open up and read if you follow the link above.

    This is perhaps the moment to say that the words “I created a puzzle game” are slightly misleading. For one thing, I discussed the idea of gamifying Tseytin’s semigroup about three years ago with Mirek Olšák, a former member of my automatic theorem proving group, and he created a basic prototype in Python. But the main point is that I do not have the programming skills to create a game that can be played in a web browser — I vibecoded it using ChatGPT.

    After that experience, I thought that maybe I would have better luck if instead of looking for a group or semigroup with undecidable word problem, which might well have been explicitly designed with some encoding in mind, I looked for a word problem for which the decidability status was unknown. That way, it wouldn’t have been designed to be undecidable, but might nevertheless just happen to be undecidable and provide a nice playground of the kind I was (and still am) after. And that is what led me to the Artin-Tits groups.

    However, those don’t seem to be suitable either, because it is conjectured that they all have decidable word problems. I have created a game for the Artin-Tits group mentioned above, which you can also play if you want, but I have found it very hard to create interesting puzzles. (NB again there was a problem with the puzzles showing up, with only a tiny handful being there, but that should now be sorted out.) That is, I found it difficult to find words that are equivalent to the identity but that are not easily shown to be equivalent to the identity. One nice example comes from the fact that the subgroup generated by a,c and d is isomorphic to the braid group B_4 with four strands. The late Patrick Dehornoy found a very nice example of a braid with four strands that is equal to the identity but not in a completely trivial way. A picture of it can be found on page 4 of this paper.

    This is where a potential Polymath project comes in. An initial goal would be to determine the decidability status of this one small Artin-Tits group: if we managed that, then we could consider the more general problem. And the way I envisage approaching this initial goal is an iterative process that runs as follows.

    1. Devise an algorithm A_0 for solving word problems in the group.
    2. If the current algorithm is A_i, then search for a puzzle P that A_i fails to solve.
    3. If the search is successful, then devise an algorithm A_{i+1} that solves all the puzzles that A_i solves, and also solves P. Make A_{i+1} the current algorithm and return to step 2. Devising A_{i+1} can be done by playing the game with puzzle P several times until one has the feeling that one has solved it in a systematic way.
    4. If the search is unsuccessful and the current algorithm is A, then attempt to prove inductively that A always succeeds: that is, to prove that if A solves P and Q is obtained from P by applying a single relation in the group, then A solves Q.
    5. (Maybe) If the inductive step doesn’t seem to work, then try to use that failure to devise a puzzle that defeats A and return to step 3.

    I have done a couple of iterations already, with the result that I now have an algorithm — let’s call it A_0 — that solves (basically instantaneously) every puzzle I throw at it, including the one derived from Dehornoy’s braid mentioned above. So a subgoal of the initial goal is to find a puzzle that has a solution but that A_0 doesn’t solve. If we can’t, then maybe it is worth trying to prove that A_0 solves the word problem for this group. One comment about the algorithm A_0 is that it never increases the length of a word, though it often does moves that preserve the length. I was told by ChatGPT that it is an unsolved problem whether every braid that is equal to the identity can be reduced to the identity without ever increasing the number of crossings. Of course that isn’t a guarantee that it really is unsolved, but if it is, then that’s another interesting problem. It also means that I would be extremely interested if someone found an example of a word in the Artin-Tits group that can be reduced to the identity but only if one starts by lengthening it.

    I’m hoping that this will be an enjoyable project for people who like both mathematics and programming. On the maths side, there is an unsolved problem to think about, and on the programming side there are lots of possibilities: for example, one could write programs that explore the space of words equal to the identity, trying to do so in such a way that there is a reasonable chance of reaching a word where it isn’t obvious how to reverse all the steps and get back to the identity. The game comes with a “sandbox” that has a few tools for generating words, but at the moment it is fairly primitive, and I would welcome suggestions for how to improve it.

    It seems to me that as Polymath projects go, one can’t really lose: either we find an algorithm for the word problem, which establishes an unknown (at least according to ChatGPT) case of a decidability problem, or we find a suite of harder and harder puzzles, creating a more and more challenging and entertaining game and obtaining a deeper understanding of the Artin-Tits group in the process.

    This Artin-Tits game has only a rather rudimentary and not very good tutorial created by ChatGPT. It’s easy to play once you’ve got the hang of it, and in the end I think the easiest way to get the hang of it is to watch someone else play it for a couple of minutes. I have therefore created a video tutorial, which you can find here. The video itself lasts about 25 minutes, but the tutorial part is under 10 minutes: what makes the video longer is an explanation of various features for making it easier to create puzzles, and also an explanation of how the algorithm works, which again is much easier to explain if I can demonstrate it on screen than if I have to write down some text. (I do have some text, since that is what I used as a prompt for ChatGPT to implement the algorithm, but it took a few iterations to get it working properly, so now I’m not sure what the quality of the code will be like, or even whether it is doing exactly what I want, though it appears to be.) Although I have embedded the video into this post, if you actually want to see what is going on you will probably need to watch it on YouTube using the full screen. I recommend not watching the explanation of the algorithm until you have played the game a few times first. Also, I should warn you that the games in the Advanced category are not all soluble, but games 1, 5 and 6 definitely are. Another thing I forgot to say on the video is that if you want you can “rotate” a word by clicking on one of its letters and dragging it to the right or left. If the word is equal to the identity, then its rotations will be as well, so this is a valid move. There is also a button for disabling this rotation facility if you want the puzzle to be slightly more challenging.

    A quick note about the availability of the two games. They are hosted on the Netlify platform. I am on their free plan, which gives me a certain number of “credits” each month. I’m not sure how quickly these will run out, largely because I have no idea how many people will actually think it interesting to play the games. If they do run out, then the games will cease to be available until the credits are renewed, which for me happens on the 10th of each month. If this has happened and you are keen to play one of the games, another option is to download the html files and open them in your browser. Here is a link to the Artin-Tits game and here is one to the Tseytin game. If you are feeling particularly public-spirited, especially if you think you will play quite a lot, then you might consider doing that anyway, so that the Netlify credits run down more slowly. If the running out of the credits is quick enough for all this to be a real issue, then I may move to a paid plan.

    I’ll finish with a quick tip for playing the Artin-Tits game, which I mentioned on the video but perhaps didn’t stress enough. Many of the moves consist in selecting three consecutive letters and using a relation of the form xyx=yxy to change them. Easy examples of this are replacement rules such as aba\to bab. But what if inverses are involved? I’ll represent inverses of generators with upper-case letters, so for example abA represents aba^{-1}, which in the game would be a white a followed by a white b followed by a black a. The word abA turns out to equal Bab. To remember this, a simple rule is that two letters of the same colour can be bracketed together and “pushed past” the third letter, which retains its colour but changes its value. Here, for example, we write abA as (ab)A and then swap them over, changing A into B in the process, getting B(ab), or Bab without the brackets. In group theoretic terms, this is of course saying that A and B are conjugates, and that ab is what conjugates one to the other. But when playing the game it is convenient to remember it by thinking that when you see a subword such as bDB, you can push the DB (and in particular the D) to the left, getting DBd.

    http://gowers.wordpress.com/?p=6642
    Extensions
    Creating a database of motivated proofs
    ComputingDemystifying proofsNewsaiautomatic-theorem-provingmathematics
    It’s been over three years since my last post on this blog and I have sometimes been asked, understandably, whether the project I announced in my previous post was actually happening. The answer is yes — the grant I received from the Astera Institute has funded several PhD students and a couple of postdocs, and […]
    Show full content

    It’s been over three years since my last post on this blog and I have sometimes been asked, understandably, whether the project I announced in my previous post was actually happening. The answer is yes — the grant I received from the Astera Institute has funded several PhD students and a couple of postdocs, and we have been busy. In my previous post I suggested that I would be open to remote collaboration, but that has happened much less, partly because a Polymath-style approach would have been difficult to manage while also ensuring that my PhD students would have work that they could call their own to put in their theses.

    In general I don’t see a satisfactory solution to that problem, but in this post I want to mention a subproject of the main project that is very much intended to be a large public collaboration. A few months ago, a call came out from Renaissance Philanthropies saying that they were launching a $9m AI for Math Fund to spend on projects in the general sphere of AI and mathematics, and inviting proposals. One of the categories that they specifically mentioned was creating new databases, and my group submitted a proposal to create a database of what we call “structured motivated proofs,” a piece of terminology that I will explain a bit more later in just a moment. I am happy to report that our proposal was one of the 29 successful ones. Since a good outcome to the project will depend on collaboration from many people outside the group, we need to publicize it, which is precisely the purpose of this post. Below I will be more specific about the kind of help we are looking for.

    Why might yet another database of theorems and proofs be useful?

    The underlying thought behind this project is that AI for mathematics is being held back not so much by an insufficient quantity of data as by the wrong kind of data. (For a more general exploration of this theme, see here.) All mathematicians know, and some of us enjoy complaining about it, that it is common practice when presenting a proof in a mathematics paper, or even textbook, to hide the thought processes that led to the proof. Often this does not matter too much, because the thought processes may be standard ones that do not need to be spelt out to the intended audience. But when proofs start to get longer and more difficult, they can be hard to read because one has to absorb definitions and lemma statements that are not obviously useful, are presented as if they appeared from nowhere, and demonstrate their utility only much later in the argument.

    A sign that this is a problem for AI is the behaviour one observes after asking an LLM to prove a statement that is too difficult for it. Very often, instead of admitting defeat, it will imitate the style of a typical mathematics paper and produce rabbits out of hats, together with arguments later on that those rabbits do the required job. The problem is that, unlike with a correct mathematics paper, one finds when one scrutinizes the arguments carefully that they are wrong. However, it is hard to find superficial features that distinguish between an incorrect rabbit with an incorrect argument justifying that rabbit (especially if the argument does not go into full detail) and a correct one, so the kinds of statistical methods used by LLMs do not have an easy way to penalize the incorrectness.

    Of course, that does not mean that LLMs cannot do mathematics at all — they are remarkably good at it, at least compared with what I would have expected three years ago. How can that be, given the problem I have discussed in the previous paragraph?

    The way I see it (which could change — things move so fast in this sphere), the data that is currently available to train LLMs and other systems is very suitable for a certain way of doing mathematics that I call guess and check. When trying to solve a maths problem, you will normally write down the routine parts of an argument without any fuss (and an LLM can do them too because it has seen plenty of similar examples), but if the problem as a whole is not routine, then at some point you have to stop and think, often because you need to construct an object that has certain properties (I mean this in a rather general way — the “object” might be a lemma that will split up the proof in a nice way) and it is not obvious how to do so. The guess-and-check approach to such moments is what it says: you make as intelligent a guess as you can and then see whether it has the properties you wanted. If it doesn’t, you make another guess, and you keep going until you get lucky.

    The reason an LLM might be tempted to use this kind of approach is that the style of mathematical writing I described above makes it look as though that is what we as mathematicians do. Of course, we don’t actually do that, but we tend not to mention all the failed guesses we made and how we carefully examined why they failed, modifying them in appropriate ways in response, until we finally converged on an object that worked. We also don’t mention the reasoning that often takes place before we make the guess, saying to ourselves things like “Clearly an Abelian group can’t have that property, so I need to look for a non-Abelian group.”

    Intelligent guess and check works well a lot of the time, particularly when carried out by an LLM that has seen many proofs of many theorems. I have often been surprised when I have asked an LLM a problem of the form \exists x\in X \ P(x), where P is some property that is hard to satisfy, and the LLM has had no trouble answering it. But somehow when this happens, the flavour of the answer given by the LLM leaves me with the impression that the technique it has used to construct x is one that it has seen before and regards as standard.

    If the above picture of what LLMs can do is correct (the considerations for reinforcement-learning-based systems such as AlphaProof are not identical but I think that much of what I say in this post applies to them too for slightly different reasons), then the likely consequence is that if we pursue current approaches, then we will reach a plateau: broadly speaking they will be very good at answering a question if it is the kind of question that a mathematician with the right domain expertise and good instincts would find reasonably straightforward, but will struggle with anything that is not of that kind. In particular, they will struggle with research-level problems, which are, almost by definition, problems that experts in the area do not find straightforward. (Of course, there would probably be cases where an LLM spots relatively easy arguments that the experts had missed, but that wouldn’t fundamentally alter the fact that they weren’t really capable of doing research-level mathematics.)

    But what if we had a database of theorems and proofs that did not hide the thought processes that lay behind the non-obvious details of the proofs? If we could train AI on a database of accounts of proof discoveries and if, having done so, we then asked it to provide similar accounts, then it would no longer resort to guess-and-check when it got stuck, because the proof-discovery accounts it had been trained on would not be resorting to it. There could be a problem getting it to unlearn its bad habits, but I don’t think that difficulty would be impossible to surmount.

    The next question is what such a database might look like. One could just invite people to send in stream-of-consciousness accounts of how they themselves found certain proofs, but that option is unsatisfactory for several reasons.

    1. It can be very hard to remember where an idea came from, even a few seconds after one has had it — in that respect it is like a dream, the memory of which becomes rapidly less vivid as one wakes up.
    2. Often an idea will seem fairly obvious to one person but not to another.
    3. The phrase “motivated proof” means different things to different people, so without a lot of careful moderation and curation of entries, there is a risk that a database would be disorganized and not much more helpful than a database of conventionally written proofs.
    4. A stream-of-consciousness account could end up being a bit too much about the person who finds the proof and not enough about the mathematical reasons for the proof being feasibly discoverable.

    To deal with these kinds of difficulties, we plan to introduce a notion of a structured motivated proof, by which we mean a proof that is generated in a very particular way that I will partially describe below. A major part of the project, and part of the reason we needed funding for it, is to create a platform that will make it convenient to input structured motivated proofs and difficult to insert the kinds of rabbits out of hats that make a proof mysterious and unmotivated. In this way we hope to gamify the task of creating the database, challenging people to input into our system proofs of certain theorems that appear to rely on “magic” ideas, and perhaps even offering prizes for proofs that contain steps that appear in advance to be particularly hard to motivate. (An example: the solution by Ellenberg and Gijswijt of the cap-set problem uses polynomials in a magic-seeming way. The idea of using polynomials came from an earlier paper of Croot, Lev and Pach that proved a closely related theorem, but in that paper it just appears in the statement of their Lemma 1, with no prior discussion apart from the words “in the present paper we use the polynomial method” in the introduction.)

    What is a structured motivated proof?

    I wrote about motivated proofs in my previous post, but thanks to many discussions with other members of the group, my ideas have developed quite a lot since then. Here are two ways we like to think about the concept.

    1. A structured motivated proof is one that is generated by standard moves.

    I will not go into full detail about what I mean by this, but will do so in a future post when we have created the platform that we would like people to use in order to input proofs into the database. But the basic idea is that at any one moment one is in a certain state, which we call a proof-discovery state, and there will be a set of possible moves that can take one from the current proof-discovery state to a new one.

    A proof-discovery state is supposed to be a more formal representation of the state one is in when in the middle of solving a problem. Typically, if the problem is difficult, one will have asked a number of questions, and will be aware of logical relationships between them: for example, one might know that a positive answer to Q1 could be used to create a counterexample to Q2, or that Q3 is a special case of Q4, and so on. One will also have proved some results connected with the original question, and again these results will be related to each other and to the original problem in various ways that might be quite complicated: for example P1 might be a special case of Q2, which, if true would reduce Q3 to Q4, where Q3 is a generalization of the statement we are trying to prove.

    Typically we will be focusing on one of the questions, and typically that question will take the form of some hypotheses and a target (the question being whether the hypotheses imply the target). One kind of move we might make is a standard logical move such as forwards or backwards reasoning: for example, if we have hypotheses of the form P(x) and \forall u\ P(u)\implies Q(u), then we might decide to deduce Q(x). But things get more interesting when we consider slightly less basic actions we might take. Here are three examples.

    1. We have in our list of hypotheses the fact that a function f is given by the formula f(x)=\exp(p(x)), where p is a polynomial, and our goal is to prove that there exists z such that f(z)=1. Without really thinking about it, we are conscious that f is a composition of two functions, one of which is continuous and one of which belongs to a class of functions that are all continuous, so f is continuous. Also, the conclusion \exists z\ f(z)=1 matches well the conclusion of the intermediate-value theorem. So the intermediate-value theorem comes naturally to mind and we add it to our list of available hypotheses. In practice we wouldn’t necessarily write it down, but the system we wish to develop is intended to model not just what we write down but also what is going on in our brains, so we propose a move that we call library extraction (closely related to what is often called premise selection in the literature). Note that we have to be a bit careful about library extraction. We don’t want the system to be allowed to call up results from the library that appear to be irrelevant but then magically turn out to be helpful, since those would feel like rabbits out of hats. So we want to allow extraction of results only if they are obvious given the context. It is not easy to define what “obvious” means, but there is a good rule of thumb for it: a library extraction is obvious if it is one of the first things ChatGPT thinks of when given a suitable non-cheating prompt. For example, I gave it the prompt, “I have a function f from the reals to the reals and I want to prove that there exists some z such that f(z)=1. Can you suggest any results that might be helpful?” and the intermediate-value theorem was its second suggestion. (Note that I had not even told it that f was continuous, so I did not need to make that particular observation before coming up with the prompt.)
    2. We have a goal of the form \exists x\in X\ P(x). If this were a Lean proof state, the most common way to discharge a goal of this form would be to input a choice for x. That is, we would instantiate the existential quantifier with some x_0 and our new goal would be P(x_0). However, as with library extraction, we have to be very careful about instantiation if we want our proof to be motivated, since we wish to disallow highly surprising choices of x_0 that can be found only after a long process of thought. So we have to restrict ourselves to obvious instantiations. One way that an instantiation in our system will count as obvious is if the variable is instantiated with a term that is already present in the proof-discovery state. If the desired term is not present, then in order to continue with the proof, it will be necessary to carry out moves that generate it. A very common technique for this is the use of metavariables: instead of guessing a suitable x_0, we create a variable x^\bullet and change the goal to P(x^\bullet), which we can think of as saying “I’m going to start trying to prove P(x^\bullet) even though I haven’t chosen x^\bullet yet. As the attempted proof proceeds, I will note down any properties Q_1,\dots,Q_k that x^\bullet might have that would help me finish the proof, in the hope that (i) I get to the end and (ii) the problem \exists x\ Q_1(x)\wedge\dots\wedge Q_k(x) is easier than the original problem.” Another kind of obvious instantiation is one where we try out an object that is “extreme” in some way — it might be the smallest element of X, or the largest, or the simplest. (Judging simplicity is another place where the ChatGPT rule of thumb can be used.)
    3. We cannot see how to answer the question we are focusing on so we ask a related question. Two very common kinds of related question (as emphasized by Polya) are generalization and specialization. Perhaps we don’t see why a hypothesis is helpful, so we see whether the result holds if we drop that hypothesis. If it does, then we are no longer distracted by an irrelevant hypothesis. If it does not, then we can hope to find a counterexample that will help us understand how to use the hypothesis. Or perhaps we are trying to prove a general statement but it is not clear how to do so, so instead we formulate some special cases, hoping that we can prove them and spot features of the proofs that we can generalize. Again we have to be rather careful here not to allow “non-obvious” generalizations and specializations. Roughly the idea there is that a generalization should be purely logical — for example, dropping a hypothesis is fine but replacing the hypothesis “f is twice differentiable” by “f is upper semicontinuous” is not — and that a specialization should be to a special case that counts as an obvious instantiation in the sense discussed just above.
    2. A structured motivated proof is one that can be generated with the help of a point-and-click system.

    This is a surprisingly useful way to conceive of what we are talking about, especially as it relates closely to what I was talking about earlier: imposing a standard form on motivated proofs (which is why we call them “structured” motivated proofs) and gamifying the process of producing them.

    The idea is that a structured motivated proof is one that can be generated using an interface (which we are in the process of creating — at the moment we have a very basic prototype that has a few of the features we will need, but not yet the more interesting ones) that has one essential property: the user cannot type in data. So what can they do? They can select text that is on their screen (typically mathematical expressions or subexpressions), they can click buttons, choose items from drop-down menus, and accept or reject “obvious” suggestions made to them by the interface.

    If, for example, the current goal is an existential statement \exists x\ P(x), then typing in a formula that defines a suitable x is not possible, so instead one must select text or generate new text by clicking buttons, choosing from short drop-down menus, and so on. This forces the user to generate x, which is our proxy for showing where the idea of using x came from.

    Broadly speaking, the way the prototype works is to get an LLM to read a JSON object that describes the variables, hypotheses and goals involved in the proof state in a structured format, and to describe (by means of a fairly long prompt) the various moves it might be called upon to do. Thus, the proofs generated by the system are not formally verified, but that is not an issue that concerns us in practice since there will be a human in the loop throughout to catch any mistakes that the LLM might make, and this flexibility may even work to our advantage to better capture the fluidity of natural-language mathematics.

    There is obviously a lot more to say about what the proof-generating moves are, or (approximately equivalently) what the options provided by a point-and-click system will be. I plan to discuss that in much more detail when we are closer to having an interface ready, the target for which is the end of this calendar year. But the aim of the project is to create a database of examples of proofs that have been successfully generated using the interface, which can then be used to train AI to play the generate-structured-motivated-proof game.

    How to get involved.

    There are several tasks that will need doing once the project gets properly under way. Here are some of the likely ones.

    1. The most important is for people to submit structured motivated (or move-generated) proofs to us on the platform we provide. We hope that the database will end up containing proofs of a wide range of difficulty (of two kinds — there might be fairly easy arguments that are hard to motivate and there might be arguments that are harder to follow but easier to motivate) and also a wide range of areas of mathematics. Our initial target, which is quite ambitious, is to have around 1000 entries by two years from now. While we are not in a position to accept entries yet, if you are interested in participating, then it is not too early to start thinking in a less formal way about how to convert some of your favourite proofs into motivated versions, since that will undoubtedly make it easier to get them accepted by our platform when it is ready.
    2. We are in the process of designing the platform. As I mentioned earlier, we already have a prototype, but there are many moves we will need it to be able to do that it cannot currently do. For example, the current prototype allows just a single proof state, which consists of some variable declarations, hypotheses, and goals. It does not yet support creating subsidiary proof states (which we would need if we wanted to allow the user to consider generalizations and specializations, for example). Also, for the moment the prototype gets an LLM to implement all moves, but some of the moves, such as applying modus ponens, are extremely mechanical and would be better done using a conventional program. (On the other hand, moves such as “obvious library extraction” or “provide the simplest example” are better done by an LLM.) Thirdly, a technical problem is that LaTeX is currently rendered as images, which makes it hard to select subexpressions, something we will need to be able to do in a non-clunky way. And the public version of the platform will need to be web-based and very convenient to use. We will want features such as being able to zoom out and look at some kind of dependency diagram of all the statements and questions currently in play, and then zoom in on various nodes if the user wishes to work on them. If you think you may be able (and willing) to help with some of these aspects of the platform, then we would be very happy to hear from you. For some, it would probably help to have a familiarity with proof assistants, while for others we would be looking for somebody with software engineering experience. The grant from the AI for Math Fund will allow us to pay for some of this help, at rates to be negotiated. We are not yet ready to specify in detail what help we need, but would welcome any initial expressions of interest.
    3. Once the platform is ready and people start to submit proofs, it is likely that, at least to start with, they will find that the platform does not always provide the moves they need. Perhaps they will have a very convincing account of where a non-obvious idea in the proof came from, but the system won’t be expressive enough for them to translate that account into a sequence of proof-generating moves. We will want to be able to react to such situations (if we agree that a new move is needed) by expanding the capacity of the platform. It will therefore be very helpful if people sign up to be beta-testers, so that we can try to get the platform to a reasonably stable state before opening it up to a wider public. Of course, to be a beta-tester you would need to have a few motivated proofs in mind.
    4. It is not obvious that every proof submitted via the platform, even if submitted successfully, would be a useful addition to the database. For instance, it might be such a routine argument that no idea really needs to have its origin explained. Or it might be that, despite our best efforts, somebody finds a way of sneaking in a rabbit while using only the moves that we have provided. (One way this could happen is if an LLM made a highly non-obvious suggestion that happened to work, in which case the rule of thumb that if an LLM thinks of it, it must be obvious, would have failed in that instance.) For this reason, we envisage having a team of moderators, who will check entries and make sure that they are good additions to the database. We hope that this will be an enjoyable task, but it may have its tedious aspects, so we envisage paying moderators — again, this expense was allowed for in our proposal to the AI for Math Fund.

    If you think you might be interested in any of these roles, please feel free to get in touch. Probably the hardest recruitment task for us will be identifying the right people with the right mixture of mathematical knowledge and software engineering skills to help us turn the platform into a well-designed web-based one that is convenient and pleasurable to use. If you think you might be such a person, or if you have a good idea for how we should go about finding one, we would be particularly interested to hear from you.

    In a future post, I will say more about the kinds of moves that our platform will allow, and will give examples of non-motivated proofs together with how motivated versions of those proofs can be found and entered using the platform (which may involve a certain amount of speculation about what the platform will end up looking like).

    How does this relate to use of tactics in a proof assistant?

    In one way, our “moves” can be regarded as tactics of a kind. However, some of the moves we will need are difficult to implement in conventional proof assistants such as Lean. In parallel with the work described above, we hope to create an interface to Lean that would allow one to carry out proof-discovery moves of the kind discussed above but with the proof-discovery states being collections of Lean proof states. Members of my group have already been working on this and have made some very interesting progress, but there is some way to go. However, we hope that at some point (and this is also part of the project pitched to the AI for Math Fund) we will have created another interface that will have Lean working in the background, so that it will be possible to generate motivated proofs that will be (or perhaps it is better to say include) proofs in Lean at the same time.

    Another possibility that we are also considering is to use the output of the first platform (which, as mentioned above, will be fairly formal, but not in the strict sense of a language such as Lean) to create a kind of blueprint that can then be autoformalized automatically. Then we would have a platform that would in principle allow mathematicians to search for proofs while working on their computers without having to learn a formal language, with their thoughts being formalized as they go.

    http://gowers.wordpress.com/?p=6580
    Extensions
    Announcing an automatic theorem proving project
    complexityComputingDemystifying proofsNewspolymath
    I am very happy to say that I have recently received a generous grant from the Astera Institute to set up a small group to work on automatic theorem proving, in the first instance for about three years after which we will take stock and see whether it is worth continuing. This will enable me […]
    Show full content

    I am very happy to say that I have recently received a generous grant from the Astera Institute to set up a small group to work on automatic theorem proving, in the first instance for about three years after which we will take stock and see whether it is worth continuing. This will enable me to take on up to about three PhD students and two postdocs over the next couple of years. I am imagining that two of the PhD students will start next October and that at least one of the postdocs will start as soon as is convenient for them. Before any of these positions are advertised, I welcome any informal expressions of interest: in the first instance you should email me, and maybe I will set up Zoom meetings. (I have no idea what the level of demand is likely to be, so exactly how I respond to emails of this kind will depend on how many of them there are.)

    I have privately let a few people know about this, and as a result I know of a handful of people who are already in Cambridge and are keen to participate. So I am expecting the core team working on the project to consist of 6-10 people. But I also plan to work in as open a way as possible, in the hope that people who want to can participate in the project remotely even if they are not part of the group that is based physically in Cambridge. Thus, part of the plan is to report regularly and publicly on what we are thinking about, what problems, both technical and more fundamental, are holding us up, and what progress we make. Also, my plan at this stage is that any software associated with the project will be open source, and that if people want to use ideas generated by the project to incorporate into their own theorem-proving programs, I will very much welcome that.

    I have written a 54-page document that explains in considerable detail what the aims and approach of the project will be. I would urge anyone who thinks they might want to apply for one of the positions to read it first — not necessarily every single page, but enough to get a proper understanding of what the project is about. Here I will explain much more briefly what it will be trying to do, and what will set it apart from various other enterprises that are going on at the moment.

    In brief, the approach taken will be what is often referred to as a GOFAI approach, where GOFAI stands for “good old-fashioned artificial intelligence”. Roughly speaking, the GOFAI approach to artificial intelligence is to try to understand as well as possible how humans achieve a particular task, and eventually reach a level of understanding that enables one to program a computer to do the same.

    As the phrase “old-fashioned” suggests, GOFAI has fallen out of favour in recent years, and some of the reasons for that are good ones. One reason is that after initial optimism, progress with that approach stalled in many domains of AI. Another is that with the rise of machine learning it has become clear that for many tasks, especially pattern-recognition tasks, it is possible to program a computer to do them very well without having a good understanding of how humans do them. For example, we may find it very difficult to write down a set of rules that distinguishes between an array of pixels that represents a dog and an array of pixels that represents a cat, but we can still train a neural network to do the job.

    However, while machine learning has made huge strides in many domains, it still has several areas of weakness that are very important when one is doing mathematics. Here are a few of them.

    1. In general, tasks that involve reasoning in an essential way.
    2. Learning to do one task and then using that ability to do another.
    3. Learning based on just a small number of examples.
    4. Common sense reasoning.
    5. Anything that involves genuine understanding (even if it may be hard to give a precise definition of what understanding is) as opposed to sophisticated mimicry.

    Obviously, researchers in machine learning are working in all these areas, and there may well be progress over the next few years [in fact, there has been progress on some of these difficulties already of which I was unaware — see some of the comments below], but for the time being there are still significant limitations to what machine learning can do. (Two people who have written very interestingly on these limitations are Melanie Mitchell and François Chollet.)

    That said, using machine learning techniques in automatic theorem proving is a very active area of research at the moment. (Two names you might like to look up if you want to find out about this area are Christian Szegedy and Josef Urban.) The project I am starting will not be a machine-learning project, but I think there is plenty of potential for combining machine learning with GOFAI ideas — for example, one might use GOFAI to reduce the options for what the computer will do next to a small set and use machine learning to choose the option out of that small set — so I do not rule out some kind of wider collaboration once the project has got going.

    Another area that is thriving at the moment is formalization. Over the last few years, several major theorems and definitions have been fully formalized that would have previously seemed out of reach — examples include Gödel’s theorem, the four-colour theorem, Hales’s proof of the Kepler conjecture, Thompson’s odd-order theorem, and a lemma of Dustin Clausen and Peter Scholze with a proof that was too complicated for them to be able to feel fully confident that it was correct. That last formalization was carried out in Lean by a team led by Johan Commelin, which is part of the more general and exciting Lean group that grew out of Kevin Buzzard‘s decision a few years ago to incorporate Lean formalization into his teaching at Imperial College London.

    As with machine learning, I mention formalization in order to contrast it with the project I am announcing here. It may seem slightly negative to focus on what it will not be doing, but I feel it is important, because I do not want to attract applications from people who have an incorrect picture of what they would be doing. Also as with machine learning, I would welcome and even expect collaboration with the Lean group. For us it would be potentially very interesting to make use of the Lean database of results, and it would also be nice (even if not essential) to have output that is formalized using a standard system. And we might be able to contribute to the Lean enterprise by creating code that performs steps automatically that are currently done by hand. A very interesting looking new institute, the Hoskinson Center for Formal Mathematics, has recently been set up with Jeremy Avigad at the helm, which will almost certainly make such collaborations easier.

    But now let me turn to the kinds of things I hope this project will do.

    Why is mathematics easy?

    Ever since Turing, we have known that there is no algorithm that will take as input a mathematical statement and output a proof if the statement has a proof or the words “this statement does not have a proof” otherwise. (If such an algorithm existed, one could feed it statements of the form “algorithm A halts” and the halting problem would be solved.) If P\ne NP, then there is not even a practical algorithm for determining whether a statement has a proof of at most some given length — a brute-force algorithm exists, but takes far too long. Despite this, mathematicians regularly find long and complicated proofs of theorems. How is this possible?

    The broad answer is that while the theoretical results just alluded to show that we cannot expect to determine the proof status of arbitrary mathematical statements, that is not what we try to do. Rather, we look at only a tiny fraction of well-formed statements, and the kinds of proofs we find tend to have a lot more structure than is guaranteed by the formal definition of a proof as a sequence of statements, each of which is either an initial assumption or follows in a simple way from earlier statements. (It is interesting to speculate about whether there are, out there, utterly bizarre and idea-free proofs that just happen to establish concise mathematical statements but that will never be discovered because searching for them would take too long.) A good way of thinking about this project is that it will be focused on the following question.

    Question. What is it about the proofs that mathematicians actually find that makes them findable in a reasonable amount of time?

    Clearly, a good answer to this question would be extremely useful for the purposes of writing automatic theorem proving programs. Equally, any advances in a GOFAI approach to writing automatic theorem proving programs have the potential to feed into an answer to the question. I don’t have strong views about the right balance between the theoretical and practical sides of the project, but I do feel strongly that both sides should be major components of it.

    The practical side of the project will, at least to start with, be focused on devising algorithms that find proofs in a way that imitates as closely as possible how humans find them. One important aspect of this is that I will not be satisfied with programs that find proofs after carrying out large searches, even if those searches are small enough to be feasible. More precisely, searches will be strongly discouraged unless human mathematicians would also need to do them. A question that is closely related to the question above is the following, which all researchers in automatic theorem proving have to grapple with.

    Question. Humans seem to be able to find proofs with a remarkably small amount of backtracking. How do they prune the search tree to such an extent?

    Allowing a program to carry out searches of “silly” options that humans would never do is running away from this absolutely central question.

    With Mohan Ganesalingam, Ed Ayers and Bhavik Mehta (but not simultaneously), I have over the years worked on writing theorem-proving programs with as little search as possible. This will provide a starting point for the project. One of the reasons I am excited to have the chance to set up a group is that I have felt for a long time that with more people working on the project, there is a chance of much more rapid progress — I think the progress will scale up more than linearly in the number of people, at least up to a certain size. And if others were involved round the world, I don’t think it is unreasonable to hope that within a few years there could be theorem-proving programs that were genuinely useful — not necessarily at a research level but at least at the level of a first-year undergraduate. (To be useful a program does not have to be able to solve every problem put in front of it: even a program that could solve only fairly easy problems but in a sufficiently human way that it could explain how it came up with its proofs could be a very valuable educational tool.)

    A more distant dream is of course to get automatic theorem provers to the point where they can solve genuinely difficult problems. Something else that I would like to see coming out of this project is a serious study of how humans do this. From time to time I have looked at specific proofs that appear to require at a certain point an idea that comes out of nowhere, and after thinking very hard about them I have eventually managed to present a plausible account of how somebody might have had the idea, which I think of as a “justified proof”. I would love it if there were a large collection of such accounts, and I have it in mind as a possible idea to set up (with help) a repository for them, though I would need to think rather hard about how best to do it. One of the difficulties is that whereas there is widespread agreement about what constitutes a proof, there is not such a clear consensus about what constitutes a convincing explanation of where an idea comes from. Another theoretical problem that interests me a lot is the following.

    Problem. Come up with a precise definition of a “proof justification”.

    Though I do not have a satisfactory definition, very recently I have had some ideas that will I think help to narrow down the search substantially. I am writing these ideas down and hope to make them public soon.

    Who am I looking for?

    There is much more I could say about the project, but if this whets your appetite, then I refer you to the document where I have said much more about it. For the rest of this post I will say a little bit more about the kind of person I am looking for and how a typical week might be spent.

    The most important quality I am looking for in an applicant for a PhD or postdoc associated with this project is a genuine enthusiasm for the GOFAI approach briefly outlined here and explained in more detail in the much longer document. If you read that document and think that that is the kind of work you would love to do and would be capable of doing, then that is a very good sign. Throughout the document I give indications of things that I don’t yet know how to do. If you find yourself immediately drawn into thinking about those problems, which range from small technical problems to big theoretical questions such as the ones mentioned above, then that is even better. And if you are not fully satisfied with a proof unless you can see why it was natural for somebody to think of it, then that is better still.

    I would expect a significant proportion of people reading the document to have an instinctive reaction that the way I propose to attack the problems is not the best way, and that surely one should use some other technique — machine learning, large search, the Knuth-Bendix algorithm, a computer algebra package, etc. etc. — instead. If that is your reaction, then the project probably isn’t a good fit for you, as the GOFAI approach is what it is all about.

    As far as qualifications are concerned, I think the ideal candidate is somebody with plenty of experience of solving mathematical problems (either challenging undergraduate-level problems for a PhD candidate or research-level problems for a postdoc candidate), and good programming ability. But if I had to choose one of the two, I would pick the former over the latter, provided that I could be convinced that a candidate had a deep understanding of what a well-designed algorithm would look like. (I myself am not a fluent programmer — I have some experience of Haskell and Python and I think a pretty good feel for how to specify an algorithm in a way that makes it implementable by somebody who is a quick coder, and in my collaborations so far have relied on my collaborators to do the coding.) Part of the reason for that is that I hope that if one of the outputs of the group is detailed algorithm designs, then there will be remote participants who would enjoy turning those designs into code.

    How will the work be organized?

    The core group is meant to be a genuine team rather than simply a group of a few individuals with a common interest in automatic theorem proving. To this end, I plan that the members of the group will meet regularly — I imagine something like twice a week for at least two hours and possibly more — and will keep in close contact, and very likely meet less formally outside those meetings. The purpose of the meetings will be to keep the project appropriately focused. That is not to say that all team members will work on the same narrow aspect of the problem at the same time. However, I think that with a project like this it will be beneficial (i) to share ideas frequently, (ii) to keep thinking strategically about how to get the maximum expected benefit for the effort put in , and (iii) to keep updating our public list of open questions (which will not be open questions in the usual mathematical sense, but questions more like “How should a computer do this?” or “Why is it so obvious to a human mathematician that this would be a silly thing to try?”).

    In order to make it easy for people to participate remotely, I think probably we will want to set up a dedicated website where people can post thoughts, links to code, questions, and so on. Some thought will clearly need to go into how best to design such a site, and help may be needed to build it, which if necessary I could pay for. Another possibility would of course be to have Zoom meetings, but whether or not I would want to do that depends somewhat on who ends up participating and how they do so.

    Since the early days of Polymath I have become much more conscious that merely stating that a project is open to anybody who wishes to join in does not automatically make it so. For example, whereas I myself am comfortable with publicly suggesting a mathematical idea that turns out on further reflection to be fruitless or even wrong, many people are, for very good reasons, not willing to do so, and those people belong disproportionately to groups that have been historically marginalized from mathematics — which of course is not a coincidence. Because of this, I have not yet decided on the details of how remote participation might work. Maybe part of it could be fully open in the way that Polymath was, but part of it could be more private and carefully moderated. Or perhaps separate groups could be set up that communicated regularly with the Cambridge group. There are many possibilities, but which ones would work best depends on who is interested. If you are interested in the project but would feel excluded by certain modes of participation, then please get in touch with me and we can think about what would work for you.

    http://gowers.wordpress.com/?p=6531
    Extensions
    Leicester mathematics under threat again
    Uncategorized
    Four years ago I wrote a post about an awful plan by Leicester University to sack its entire mathematics department, invite them to reapply for their jobs, and rehire all but six “lowest performers”. Fortunately, after an outcry, the university backed down. Alas, now there’s a new vice-chancellor who appears to have learned nothing from […]
    Show full content

    Four years ago I wrote a post about an awful plan by Leicester University to sack its entire mathematics department, invite them to reapply for their jobs, and rehire all but six “lowest performers”. Fortunately, after an outcry, the university backed down.

    Alas, now there’s a new vice-chancellor who appears to have learned nothing from the previous debacle. This time, the plan, known by the nice fluffy name Shaping for Excellence, is to get rid of research in certain subjects of which pure mathematics is one (and medieval literature another). This would mean making all eight pure mathematicians at Leicester redundant. The story is spreading rapidly on social media (it’s attracted quite a bit of attention on Twitter, Reddit and Hacker News, for example), so I won’t write a long post. But just in case you haven’t heard about it, here’s a link to a petition you can sign if, like a lot of other people, you feel strongly that this is a bad decision. (At the time of writing, it has been signed by about 2,500 people, many of them very well known academics in the areas that Leicester University claims to be intending to promote, in well under 24 hours.)

    http://gowers.wordpress.com/?p=6526
    Extensions
    Mathematical Research Reports: a “new” mathematics journal is launched
    News
    From time to time academic journals undergo an interesting process of fission. Typically as a result of some serious dissatisfaction, the editorial board resigns en masse to set up a new journal, the publishers of the original journal build a new editorial board from scratch, and the result is two journals, one inheriting the editors […]
    Show full content

    From time to time academic journals undergo an interesting process of fission. Typically as a result of some serious dissatisfaction, the editorial board resigns en masse to set up a new journal, the publishers of the original journal build a new editorial board from scratch, and the result is two journals, one inheriting the editors and collective memory of the original journal, and the other keeping the name and the publisher. Which is the “true” successor? In practice it tends to be the one with the editors, with its sibling surviving as a zombie journal that is the successor in name only. Perhaps there are examples that go the other way, and there may be examples where both journals go on to thrive, but I have not looked closely at the examples I know about.

    I’m mentioning this because recently I have been involved in a rather unusual example of this phenomenon. Most cases I know of are the result either of frustration with the practices of the big commercial publishers or of malpractice by an editor-in-chief. But this was an open access journal with no publication charges, and with an extremely efficient and impeccably behaved editor-in-chief. So what was the problem?

    The journal started out in 1995 as Electronic Research Announcements of the AMS, or ERA-AMS for short. It was still called that when I first joined the editorial board. Its editor-in-chief was Svetlana Katok, who did a great job, and there was a high-powered editorial board. As its name suggests, it specialized in shortish papers announcing results that would then appear with more details in significantly longer papers, so it was a little like Comptes Rendus in its aim. It would also accept short articles of a more traditional kind.

    It never published all that many papers, and in 2007, I think for that reason (but don’t remember for sure), the AMS decided to discontinue it. But Svetlana Katok had put a lot into the journal and managed to find another publisher, the American Institute of Mathematical Sciences, and the editorial board agreed to continue serving. The name of the journal was changed to Electronic Research Announcements in the Mathematical Sciences, and its abbreviation was slightly abbreviated from ERA-AMS to ERA-MS.

    In 2016, after 22 years, Svetlana Katok decided to step down, and Boris Hasselblatt took over. It was a good moment to try to revitalize the journal, so measures were taken such as designing a new and better website and making more effort to publicize the journal, in the hope of attracting more submissions (or more precisely, more submissions of a high enough quality that we would want to publish them).

    However, despite these measures, the numbers remained fairly low — around ten a year (with quite a bit of variation), and this, indirectly, caused the problem that led to the split. The editors would have liked to see more papers published, but were not worried about it to the point where we would have been prepared to sacrifice quality to achieve it: we were ready to accept that this was, at least for now, a small journal. But AIMS was not so happy. In an effort to remedy (as they saw it) the situation, they appointed a co-editor-in-chief, who in turn appointed a number of new editors, with a more applied focus, with the idea that by broadening the scope of the journal they would increase the number of papers published.

    That did not precipitate the resignations, but at that stage most of us did not know that the new editors had been appointed without any consultation even with Boris Hasselblatt. But then AIMS took things a step further. Until that point, the journal had adopted a practice that I strongly approve of, which was for the editor who handled a paper to make a recommendation to the rest of the editorial board, with other editors encouraged to comment on that recommendation. This practice helps to guard against “rogue” editors and against abuse of the system in general. It also helps to maintain consistent standards, and provides a gentle pressure on editors to do their job conscientiously — there’s nothing like knowing that you’re going to have to justify your decision to a bunch of mathematicians.

    But suddenly the publishers told us that this system had to change, and that from now on the editorial board would not have the opportunity to vet papers, and would continue to have no say in new editorial appointments. (Various justifications were given for this, including that it would make it harder to recruit editors if they thought they had to make judgments about papers not in their immediate area.) At that point, it was clear that the soul of the journal was about to be destroyed, so over a few days the entire board (from before the start of the changes) resigned, resolving to start afresh with a new name.

    That new name is Mathematical Research Reports. We will continue to accept reports on longer work, as well as short articles. In addition we welcome short survey articles. We regard it as the continuation in spirit of ERA-MS. Another unusual feature of this particular split is that the other half, still published by AIMS, has also changed its name and is now called Electronic Research Archive.

    If, like me, you are always on the lookout for high-quality “ethical” journals (which I loosely define as free to read, free to publish in, and adopting high standards of editorial practice), then please add Mathematical Research Reports to your list. Have a look at the back catalogue of ERA-MS and ERA-AMS and you will get an idea of our standards. It would be wonderful if the unfortunate events of the last year or so were to be the catalyst that led to the journal finally becoming established in the way that it has deserved to be for a long time.

    http://gowers.wordpress.com/?p=6513
    Extensions
    How long should a lockdown-relaxation cycle last?
    News
    The adaptive-triggering policy. On page 12 of a document put out by Imperial College London, which has been very widely read and commented on, and which has had a significant influence on UK policy concerning the coronavirus, there is a diagram that shows the possible impact of a strategy of alternating between measures that are […]
    Show full content
    The adaptive-triggering policy.

    On page 12 of a document put out by Imperial College London, which has been very widely read and commented on, and which has had a significant influence on UK policy concerning the coronavirus, there is a diagram that shows the possible impact of a strategy of alternating between measures that are serious enough to cause the number of cases to decline, and more relaxed measures that allow it to grow again. They call this adaptive triggering: when the number of cases needing intensive care reaches a certain level per week, the stronger measures are triggered, and when it declines to some other level (the numbers they give are 100 and 50, respectively), they are lifted.

    If such a policy were ever to be enacted, a very important question would be how to optimize the choice of the two triggers. I’ve tried to work this out, subject to certain simplifying assumptions (and it’s important to stress right at the outset that these assumptions are questionable, and therefore that any conclusion I come to should be treated with great caution). This post is to show the calculation I did. It leads to slightly counterintuitive results, so part of my reason for posting it publicly is as a sanity check: I know that if I post it here, then any flaws in my reasoning will be quickly picked up. And the contrapositive of that statement is that if the reasoning survives the harsh scrutiny of a typical reader of this blog, then I can feel fairly confident about it. Of course, it may also be that I have failed to model some aspect of the situation that would make a material difference to the conclusions I draw. I would be very interested in criticisms of that kind too. (Indeed, I make some myself in the post.)

    Before I get on to what the model is, I would like to make clear that I am not advocating this adaptive-triggering policy. Personally, what I would like to see is something more like what Tomas Pueyo calls The Hammer and the Dance: roughly speaking, you get the cases down to a trickle, and then you stop that trickle turning back into a flood by stamping down hard on local outbreaks using a lot of testing, contact tracing, isolation of potential infected people, etc. (This would need to be combined with other measures such as quarantine for people arriving from more affected countries etc.) But it still seems worth thinking about the adaptive-triggering policy, in case the hammer-and-dance policy doesn’t work (which could be for the simple reason that a government decides not to implement it).

    A very basic model.

    Here was my first attempt at modelling the situation. I make the following assumptions. The numbers S, T, \lambda, \mu are positive constants.

    1. Relaxation is triggered when the rate of infection is S.
    2. Lockdown (or similar) is triggered when the rate of infection is T.
    3. The rate of infection is of the form Ae^{\lambda t} during a relaxation phase.
    4. The rate of infection is of the form Be^{-\mu t} during a lockdown phase.
    5. The rate of “damage” due to infection is \theta times the infection rate.
    6. The rate of damage due to lockdown measures is \eta while those measures are in force.

    For the moment I am not concerned with how realistic these assumptions are, but just with what their consequences are. What I would like to do is minimize the average damage by choosing S and T appropriately.

    I may as well give away one of the punchlines straight away, since no calculation is needed to explain it. The time it takes for the infection rate to increase from S to T or to decrease from T to S depends only on the ratio T/S. Therefore, if we divide both T and S by 2, we decrease the damage due to the infection and have no effect on the damage due to the lockdown measures. Thus, for any fixed ratio T/S, it is best to make both T and S as small as possible.

    This has the counterintuitive consequence that during one of the cycles one would be imposing lockdown measures that were doing far more damage than the damage done by the virus itself. However, I think something like that may actually be correct: unless the triggers are so low that the assumptions of the model completely break down (for example because local containment is, at least for a while, a realistic policy, so national lockdown is pointlessly damaging), there is nothing to be lost, and lives to be gained, by keeping them in the same proportion but decreasing them.

    Now let me do the calculation, so that we can think about how to optimize the ratio T/S for a fixed S.

    The time taken for the infection rate to increase from S to T is s=\lambda^{-1}\log(T/S), and during that time the number of infections is
    \int_0^sSe^{\lambda t}dt=\lambda^{-1}S(e^{\lambda s}-1)=\lambda^{-1}(T-S).

    By symmetry the number of infections during the lockdown phase is \mu^{-1}(T-S) (just run time backwards). So during a time (\lambda^{-1}+\mu^{-1})\log(T/S) the damage done by infections is \theta(\lambda^{-1}+\mu^{-1})(T-S), making the average damage \theta(T-S)/(\log T-\log S). Meanwhile, the average damage done by lockdown measures over the whole cycle is \mu^{-1}\eta/(\lambda^{-1}+\mu^{-1})=\eta\lambda/(\lambda+\mu).

    Note that the lockdown damage doesn’t depend on S and T: it just depends on the proportion of time spent in lockdown, which depends only on the ratio between \lambda and \mu. So from the point of view of optimizing S and T, we can simply forget about the damage caused by the lockdown measures.

    Returning, therefore, to the term \theta(T-S)/(\log T-\log S), let us say that T=(1+\alpha)S. Then the term simplifies to \theta\alpha S/\log(1+\alpha). This increases with \alpha, which leads to a second counterintuitive conclusion, which is that for fixed S, \alpha should be as close as possible to 0. So if, for example, \lambda=2\mu, which tells us that the lockdown phases have to be twice as long as the relaxation phases, then it would be better to have cycles of two days of lockdown and one of relaxation than cycles of six weeks of lockdown and three weeks of relaxation.

    Can this be correct? It seems as though with very short cycles the lockdowns wouldn’t work, because for one day in three people would be out there infecting others. I haven’t yet got my head round this, but I think what has gone wrong is that the model of exponential growth followed instantly by exponential decay is too great a simplification of what actually happens. Indeed, data seem to suggest a curve that rounds off at the top rather than switching suddenly from one exponential to another — see for example Chart 9 from the Tomas Pueyo article linked to above. But I think it is correct to conclude that the length of a cycle should be at most of a similar order of magnitude to the “turnaround time” from exponential growth to exponential decay. That is, one should make the cycles as short as possible provided that they are on a timescale that is long enough for the assumption of exponential growth followed by exponential decay to be reasonably accurate.

    What if we allow a cycle with more than two kinds of phases?

    So far I have treated \lambda and \mu and \eta as parameters that we have no control over at all. But in practice that is not the case. At any one time there is a suite of measures one can take — encouraging frequent handwashing, banning large gatherings, closing schools, encouraging working from home wherever possible, closing pubs, restaurants, theatres and cinemas, enforcing full lockdown — that have different effects on the rate of growth or decline in infection and cause different levels of damage.

    It seems worth taking this into account too, especially as there has been a common pattern of introducing more and more measures as the number of cases goes up. That feels like a sensible response — intuitively one would think that the cure should be kept proportionate — but is it?

    Let’s suppose we have a collection of possible sets of measures M_1,\dots,M_r. For ease of writing I shall call them measures rather than sets of measures, but in practice each M_i is not just a single measure but a combination of measures such as the ones listed above. Associated with each measure M_i is a growth rate \lambda_i (which is positive if the measures are not strong enough to stop the disease growing and negative if they are strong enough to cause it to decay) and a damage rate \eta_i. Suppose we apply M_i for time t_i. Then during that time the rate of infection will multiply by e^{\lambda_it_i}. So if we do this for each measure, then we will get back to the starting infection rate provided that \lambda_1t_1+\dots+\lambda_rt_r=0. (This is possible because some of the \lambda_i are negative and some are positive.)

    There isn’t a particularly nice expression for the damage resulting from the disease during one of these cycles, but that does not mean that there is nothing to say. Suppose that the starting rate of infection is S_0 and that the rate after the first i stages of the cycle is S_i. Then S_i=e^{\lambda_it_i}S_{i-1}. Also, by the calculation above, the damage done during the ith stage is \theta\lambda_i^{-1}(S_i-S_{i-1}).

    In what order should the M_i be applied?

    This has an immediate consequence for the order in which the M_i should be applied. Let me consider just the first two stages. The total damage caused by the disease during these two stages is

    \theta(\lambda_1^{-1}(S_1-S_0)+\lambda_2^{-1}(S_2-S_1))

    =\theta S_0(\lambda_1^{-1}(e^{\lambda_1t_1}-1)+\lambda_2^{-1}e^{\lambda_1t_1}(e^{\lambda_2t_2}-1)).

    To make that easier to read, let’s forget the \theta S_0 term (which we’re holding constant) and concentrate on the expression

    \lambda_1^{-1}(e^{\lambda_1t_1}-1)+\lambda_2^{-1}e^{\lambda_1t_1}(e^{\lambda_2t_2}-1).

    If we reorder stages 1 and 2, we can replace this damage by

    \lambda_2^{-1}(e^{\lambda_2t_2}-1)+\lambda_1^{-1}e^{\lambda_2t_1}(e^{\lambda_1t_2}-1).

    This is an improvement if the second number is smaller than the first. But the first minus the second is equal to

    (\lambda_2^{-1}-\lambda_1^{-1})(e^{\lambda_1t_1}-1)(e^{\lambda_2t_2}-1),

    so the reordering is a good idea if \lambda_1>\lambda_2. This tells us that we should start with smaller \lambda_i and work up to bigger ones. Of course, since we are applying the measures in a cycle, we cannot ensure that the \lambda_i form an increasing sequence, but we can say, for example, that if we first apply the measures that allow the disease to spread, and then the ones that get it to decay, then during the relaxation phase we should work from the least relaxed measures to the most relaxed ones (so the growth rate will keep increasing), and during the suppression phase we should start with the strictest measures and work down to the most relaxed ones.

    It might seem strange that during the relaxation phase the measures should get gradually more relaxed as the spread worsens. In fact, I think it is strange, but I think what that strangeness is telling us is that using several different measures during the relaxation phase is not a sensible thing to do.

    Which sets of measures should be chosen?

    The optimization problem I get if I try to balance the damage from the disease with the damage caused by the various control measures is fairly horrible, so I am going to simplify it a lot in the following way. The basic principle that there is nothing to be lost by dividing everything by 2 still applies when there are lots of measures, so I shall assume that a sensible government has taken that point on board to the point where the direct damage from the disease is insignificant compared with the damage caused by the measures. (Just to be clear, I certainly don’t mean that lives lost are insignificant, but I mean that the number of lives lost to the disease is significantly smaller than the number lost as an indirect result of the measures taken to control its spread.) Given this assumption, I am free to concentrate just on the damage due to the measures M_i, so this is what I will try to minimize.

    The total damage across a full cycle is \sum_i\eta_it_i, so the average damage, which is what matters here, is

    \frac{\eta_1t_1+\eta_2t_2+\dots+\eta_rt_r}{t_1+t_2+\dots+t_r}.

    We don’t have complete freedom to choose, or else we’d obviously just choose the smallest \eta_i and go with that. The constraint is that the growth rate of the virus has to end up where it began: this is the constraint that \lambda_1t_1+\dots+\lambda_rt_r=0, which we saw earlier.

    Suppose we can find s_1,\dots,s_r such that \sum_is_i=\sum_i\lambda_is_i=0, but \sum_i\eta_is_i\ne 0. Then in particular we can find such s_1,\dots,s_r with \sum_i\eta_is_i<0. If all the t_i are strictly positive, then we can also choose them in such a way that all the t_i+s_i are still strictly positive. So if we replace each t_i by t_i-s_i, then the numerator of the fraction decreases, the denominator stays the same, and the constraint is still satisfied. It follows that we had not optimized.

    Therefore, if the choice of t_1,\dots,t_r is optimal and all the t_i are non-zero (and therefore strictly positive — we can’t run some measures for a negative amount of time) it is not possible to find s_1,\dots,s_r such that \sum_is_i=\sum_i\lambda_is_i=0, but \sum_i\eta_is_i\ne 0. This is equivalent to the statement that the vector (\eta_1,\dots,\eta_r) is a linear combination of the vectors (1,1,\dots,1) and (\lambda_1,\dots,\lambda_r). In other words, we can find \sigma,\rho such that \eta_i=\sigma-\rho\lambda_i for each i. I wrote it like that because the smaller \lambda_i is, the larger the damage one expects the measures to cause. Thus, the points form a descending sequence. (We can assume this, since if one measure causes both more damage and a higher growth rate than another, then there can be no reason to choose it.) Thus, \rho will be positive, and since at least some \lambda_i are positive, and no measures will cause a negative amount of damage, \sigma is positive as well.

    The converse of this statement is true as well. If \eta_i=\sigma-\rho\lambda_i for every i, then \sum_i\eta_it_i=\sigma\sum_it_i-\rho\sum_i\lambda_it_i=\sigma\sum_it_i, from which it follows that the average damage across the cycle is \sigma, regardless of which measures are taken for which lengths of time.

    This already shows that there is nothing to be gained from having more than one measure for the relaxation phase and one for the lockdown phase. There remains the question of how to choose the best pair of measures.

    To answer it, we can plot the points (\lambda_i,\eta_i). The relaxation points will appear to the right of the y-axis and the suppression points will appear to the left. If we choose one point from each side, then they lie in some line y=\sigma-\rho x, of which \sigma is the intercept. Since \sigma is the average damage, which we are trying to minimize, we see that our aim is to find a line segment joining a point on the left-hand side to a point on the right-hand side, and we want it to cross the y-axis as low as possible.

    It is not hard to check that the intercept of the line joining (-\mu,\xi) to (\lambda,\eta) is at \frac{\mu\xi+\lambda\eta}{\mu+\lambda}. So if we rename the points to the left of the y-axis (-\mu_i,\xi_i) and the points to the right (\lambda_j,\eta_j), then we want to minimize \frac{\mu_i\xi_i+\lambda_j\eta_j}{\mu_i+\lambda_j} over all i,j.

    Can we describe the best choice in a less formal way?

    It isn’t completely easy to convert this criterion into a rule of thumb for how best to choose two measures, one for the relaxation phase and one for the suppression phase, but we can draw a couple of conclusions from it.

    For example, suppose that for the suppression measures there is a choice between two measures, one of which works twice as quickly as the other but causes twice as much damage per unit time. Then the corresponding two points lie on a line with negative gradient that goes through the origin, and therefore lies below all points in the positive quadrant. From this it follows that the slower but less damaging measure is better. Another way of seeing that is that with the more severe measure the total damage during the lockdown phase stays the same, as does the total damage during the relaxation phase, but the length of the cycle is decreased, so the average damage is increased.

    Note that I am not saying that one should always go for less severe measures — I made the strong assumption there that the two points lay in a line through the origin. If we can choose a measure that causes damage at double the rate but acts three times as quickly as another measure, then it may turn out to be better than the less damaging but slower measure.

    However, it seems plausible that the set of points will exhibit a certain amount of convexity. That is because if you want to reduce the growth rate of infections, then at first there will be some low-hanging fruit — for example, it is not costly at all to run a public-information campaign to persuade people to wash their hands more frequently, and that can make quite a big difference — but the more you continue, the more difficult making a significant difference becomes, and you have to wheel out much more damaging measures such as school closures.

    If the points were to lie on a convex curve (and I’m definitely not claiming this, but just saying that something like it could perhaps be true), then the best pair of points would be the ones that are nearest to the y-axis on either side. This would say that the best strategy is to alternate between a set of measures that allows the disease to grow rather slowly and a set of measures that causes it to decay slowly again.

    This last conclusion points up another defect in the model, which is the assumption that a given set of measures causes damage at a constant rate. For some measures, this is not very realistic: for example, even in normal times schools alternate between periods of being closed and periods of being open (though not necessarily to a coronavirus-dictated timetable of course), so one might expect the damage from schools being 100% closed to be more than twice the damage from schools being closed half the time. More generally, it might well be better to rotate between two or three measures that all cause roughly the same rate of damage, but in different ways, so as to spread out the damage and try to avoid reaching the point where the rate of one kind of damage goes up.

    Summary of conclusions.

    Again I want to stress that these conclusions are all quite tentative, and should certainly not be taken as a guide to policy without more thought and more sophisticated modelling. However, they do at least suggest that certain policies ought not to be ruled out without a good reason.

    If adaptive triggering is going to be applied, then the following are the policies that the above analysis suggests. First, here is a quick reminder that I use the word “measure” as shorthand for “set of measures”. So for example “Encourage social distancing and close all schools, pubs, restaurants, theatres, and cinemas” would be a possible measure.

    1. There is nothing to lose and plenty to gain by making the triggers (that is, the infection rates that cause one to switch from relaxation to suppression and back again) low. This has the consequence that the triggers should be set in such a way that the damage from the measures is significantly higher than the damage caused by the disease. This sounds paradoxical, but the alternative is to make the disease worse without making the cure any less bad, and there is no point in doing that.
    2. Within reason, the cycles should be kept short.
    3. There is no point in having more than one measure for the relaxation phase and one for the suppression phase.
    4. If you must have more than one measure for each phase, then during the relaxation phase the measures should get more relaxed each time they change, and during the suppression phase they should get less strict each time they change.
    5. Given enough information about their consequences, the optimal measures can be determined quite easily, but doing the calculation in practice, especially in the presence of significant uncertainties, could be quite delicate.

    Point number 1 above seems to me to be quite a strong argument in favour of the hammer-and-dance approach. That is because the conclusion, which looks to me quite robust to changes in the model, is that the triggers should be set very low. But if they are set very low, then it is highly unlikely that the enormous damage caused by school closures, lockdowns etc. is the best approach for dealing with the cases that arise, since widespread testing and quarantining of people who test positive, contacts of those people, people who arrive from certain other countries, and so on, will probably be far less damaging, even if they are costly to do well. So I regard point number 1 as a sort of reductio ad absurdum of the adaptive-triggering approach.

    Point number 2 seems quite robust as well, but I think the model breaks down on small timescales (for reasons I haven’t properly understood), so one shouldn’t conclude from it that the cycles should be short on a timescale of days. That is what is meant by “within reason”. But they should be as short as possible provided that they are long enough for the dominant behaviour of the infection rate to be exponential growth and decay. (That does not imply that they should not be shorter than this — just that one cannot reach that conclusion without a more sophisticated model. But it seems highly likely that there is a minimum “reasonable” length for a cycle: this is something I’d be very interested to understand better.)

    Point number 3 was a clear consequence of the simple model (though it depended on taking 1 seriously enough that the damage from the disease could be ignored), but may well not be a sensible conclusion in reality, since the assumption that each measure causes damage at a rate that does not change over time is highly questionable, and dropping that assumption could make quite a big difference. Nevertheless, it is interesting to see what the consequences of that assumption are.

    Point number 4 seems to be another fairly robust conclusion. However, in the light of 3 one might hope that it would not need to be applied, except perhaps as part of a policy of “rotating” between various measures to spread the damage about more evenly.

    It seems at least possible that the optimal adaptive-triggering policy, if one had a number of choices of measures, would be to choose one set that causes the infections to grow slowly and another that causes them to shrink slowly — in other words to fine tune the measures so as to keep the infection rate roughly constant (and small). Such fine tuning would be very dangerous to attempt now, given how much uncertainty we are facing, but could become more realistic after a few cycles, when we would start to have more information about the effects of various measures.

    One final point is that throughout this discussion I have been assuming that the triggers would be based on the current rate of infection. In practice of course, this is hard to measure, which is presumably why the Imperial College paper used demand for intensive care beds instead. However, with enough data about the effects of various measures on the rate of spread of the virus, one would be less reliant on direct measurements, and could instead make inferences about the likely rate of infection given data collected over the previous few weeks. This seems better than using demand for ICU beds as a trigger, since that demand reflects the infection rate from some time earlier.

    http://gowers.wordpress.com/?p=6468
    Extensions
    Advances in Combinatorics fully launched
    News
    It’s taken longer than we originally intended, but I am very happy to report that Advances in Combinatorics, a new arXiv overlay journal that is run along similar lines to Discrete Analysis, has just published its first five articles, with plenty more in the pipeline. I am excited by the business model of the journal, […]
    Show full content

    It’s taken longer than we originally intended, but I am very happy to report that Advances in Combinatorics, a new arXiv overlay journal that is run along similar lines to Discrete Analysis, has just published its first five articles, with plenty more in the pipeline.

    I am excited by the business model of the journal, which is that its very small running costs (like Discrete Analysis, it uses the Scholastica platform, which charges $10 per submission, as well as a fixed annual charge of $250, and there are a few other costs such as archiving articles with CLOCKSS and having DOIs) are being met, for the next five years in the first instance, by the library of Queens University in Kingston Ontario, who are also providing us with very useful administrative support. My dream would be for other libraries to have the foresight to support similar ventures, since the potential for savings if the stranglehold of the big commercial publishers is broken is huge. I do not underestimate the obstacles, but for a long time I have felt that what we are waiting for is a tipping point, and even quite a small venture can in principle have a significant effect on when that tipping point occurs.

    The aim of Advances in Combinatorics is to be a top specialist combinatorics journal. Information about the editorial board, the submission process, and of course the articles themselves, can all be found on the website. Like Discrete Analysis, the journal has Editorial Introductions to each article, with the aim of making the webpage informative and fun to browse. We will be grateful for any feedback, and even more grateful for support in the form of excellent combinatorics papers while we are still getting established.

    A final remark is that although I have reached the limit of the number of journals of this type that I can be closely involved in, I would be delighted to hear from anybody who thinks that they can put together a high-quality editorial board in an area that does not have a suitable journal for people who want to publish good papers without supporting the big commercial publishers. I know people who can offer advice about suitable platforms, funding, and similar matters. It would be great if free-to-read free-to-publish journals could cover all of mathematics, but we are still a long way from that at the moment.

    http://gowers.wordpress.com/?p=6456
    Extensions
    The fate of combinatorics at Strathclyde
    News
    I have just received an email from Sergey Kitaev, one of the three combinatorialists at Strathclyde. As in many universities, they belong not to the mathematics department but to the computer science department. Kitaev informs me that the administrators of that department, in their infinite wisdom, have decided that the future of the department is […]
    Show full content

    I have just received an email from Sergey Kitaev, one of the three combinatorialists at Strathclyde. As in many universities, they belong not to the mathematics department but to the computer science department. Kitaev informs me that the administrators of that department, in their infinite wisdom, have decided that the future of the department is best served by axing discrete mathematics. I won’t write a long post about this, but instead refer you to a post by Peter Cameron that says everything I would want to say about the decision, and does so extremely cogently. I recommend that you read it if this kind of decision worries you.

    http://gowers.wordpress.com/?p=6454
    Extensions
    Voting tactically in the EU elections
    News
    This post is addressed at anyone who is voting in Great Britain in the forthcoming elections to the European Parliament and whose principal aim is to maximize the number of MEPs from Remain-supporting parties, where those are deemed to be the Liberal Democrats, the Greens, Change UK, Plaid Cymru and the Scottish National Party. If […]
    Show full content

    This post is addressed at anyone who is voting in Great Britain in the forthcoming elections to the European Parliament and whose principal aim is to maximize the number of MEPs from Remain-supporting parties, where those are deemed to be the Liberal Democrats, the Greens, Change UK, Plaid Cymru and the Scottish National Party. If you have other priorities, then the general principles laid out here may be helpful, but the examples of how to apply them will not necessarily be appropriate to your particular concerns.

    What is the voting system?

    The system used is called the d’Hondt system. The country is divided into a number of regions, and from each region several MEPs will be elected. You get one vote, and it is for a party rather than a single candidate. Once the votes are in, there are a couple of ways of thinking about how they translate into results. One that I like is to imagine that the parties have the option of assigning their votes to their candidates as they wish, and once the assignments have been made, the n candidates with the most votes get seats, where n is the number of MEPs representing the given region.

    For example, if there are three parties for four places, and their vote shares are 50%, 30% and 20%, then the first party will give 25% to two candidates and both will be elected. If the second party tries a similar trick, it will only get one candidate through because the 20% that goes to the third party is greater than the 15% going to the two candidates from the second party. So the result is two candidates for the first party, one for the second and one for the third.

    If the vote shares had been 60%, 25% and 15%, then the first party could afford to split three ways and the result would be three seats for the first party and one for the second.

    The way this is sometimes presented is as follows. Let’s go back to the first case. We take the three percentages, and for each one we write down the results of dividing it by 1, 2, 3, etc. That gives us the (approximate) numbers

    50%, 25%, 17%, 13%, 10%, …

    30%, 15%, 10%, 8%, 6%, …

    20%, 10%, 7%, 5%, 3%, …

    Looking at those numbers, we see that the biggest four are 50%, 25% from the top row, 30% from the second row, and 20% from the third row. So the first party gets two seats, the second party one and the third party one.

    How does this affect how I should vote?

    The answer to this question depends in peculiar ways on the polling in your region. Let’s take my region, Eastern England, as an example. This region gets seven MEPs, and the latest polls show these kinds of percentages.

    Brexit Party 40%
    Liberal Democrats 17%
    Labour 15%
    Greens 10%
    Conservatives 9%
    Change UK 4%
    UKIP 2%

    If the percentages stay as they are, then the threshold for an MEP is 10%. The Brexit Party gets four MEPs, and the Lib Dems, Labour and the Greens one each. But because the Brexit Party, the Greens and the Conservatives are all close to the 10% threshold, small swings can make a difference to which two out of the fourth Brexit Party candidate, the Green candidate, or the Conservative candidate gets left out. On the other hand, it would take a much bigger swing — of 3% or so — to give the second Lib Dem candidate a chance of being elected. So if your main concern is to maximize the number of Remain-supporting MEPs, you should support the Greens.

    Yes, but what if everybody were to do that?

    In principle that is an annoying problem with the d’Hondt system. But don’t worry — it just isn’t going to happen. Systematic tactical voting is at best a marginal phenomenon, but fortunately in this region a marginal phenomenon may be all it takes to make sure that the Green candidate gets elected.

    Aren’t you being defeatist? What about trying to get two Lib Dems and one Green through?

    This might conceivably be possible, but it would be difficult, and a risky strategy, since going for that could lead to just one Remain-supporting MEP. One possibility would be for Remain-leaning Labour voters to say to themselves “Well, we’re basically guaranteed an MEP, and I’d much prefer a Remain MEP to either the Conservatives or the Brexit Party, so I’ll vote Green or Lib Dem instead.” If that started showing up in polls, then one would be able to do a better risk assessment. But for now it looks better to make sure that the Green candidate gets through.

    I’m not from the Eastern region. Where can I find out how to vote in my region?

    There is a website called remainvoter.com that has done the analysis. The reason I am writing this post is that I have seen online that a lot of people are highly sceptical about their conclusions, so I wanted to explain the theory behind them (as far as I can guess it) so that you don’t have to take what they say on trust and can do the calculations for yourself.

    Just to check, I’ll look at another region and see whether I end up with a recommendation that agrees with that of remainvoter.com.

    In the South West, there are six MEPs. A recent poll shows the following percentages.

    Brexit Party 42%
    Lib Dem 20%
    Green Party 12%
    Conservatives 9%
    Labour 8%
    Change UK 4%
    UKIP 3%

    Dividing the Brexit Party vote by 3 gives 14% and dividing the Lib Dem vote by 2 gives 10%. So as things stand there would be three Brexit Party MEPs, two Lib Dem MEPs and one Green Party MEP.

    This is a bit close for comfort, but the second Lib Dem candidate is in a more precarious position than the Green Party candidate, given that the Conservative candidate is on 9%. So it would make sense for a bit of Green Party support to transfer to the Lib Dems in order to be sure that the three Remain-supporting candidates that look like being elected in the south west really are.

    Interestingly, remainvoter.com recommend supporting the Greens on the grounds that one Lib Dem MEP is bound to be elected. I’m not sure I understand this, since it seems very unlikely that the Lib Dems and the Greens won’t get at least two seats between them, so they might as well aim for three. Perhaps someone can enlighten me on this point. It could be that remainvoter.com is looking at different polls from the ones I’m looking at.

    I’m slightly perturbed by that so I’ll pick another region and try the same exercise. Perhaps London would be a good one. Here we have the following percentages (plus a couple of smaller ones that won’t affect anything).

    Liberal Democrats 24% (12%, 8%)
    Brexit Party 21% (10.5%, 7%)
    Labour 19% (9.5%, 6.3%)
    Green Party 14% (7%)
    Conservatives 10%
    Change UK 6%

    London has eight MEPs. Here I find it convenient to use the algorithm of dividing by 1,2,3 etc., which explains the percentages I’ve added in brackets. Taking the eight largest numbers we see that the current threshold to get an MEP is at 9.5%, so the Lib Dems get two, the Brexit party two, Labour two and the Greens and Conservatives one each.

    Here it doesn’t look obvious how to vote tactically. Clearly not Green, since the Greens are squarely in the middle of the range between the threshold and twice the threshold. Probably not Lib Dem either (unless things change quite a bit) since they’re unlikely to go up as far as 28.5%. But getting Change UK up to 9.5% also looks pretty hard to me. Perhaps the least difficult of these difficult options is for the Green Party to donate about 3% of the vote and the Lib Dems another 2% to Change UK, which would allow them to overtake Labour. But I don’t see it happening.

    And now to check my answer, so to speak. And it does indeed agree with the remainvoter.com recommendation. This looks to me like a case where if tactical voting were to be widely adopted, then it might just work to get another MEP, but if it were that widely adopted, one might have to start worrying about not overshooting and accidentally losing one of the other Remain MEPs. But that’s not likely to happen, and in fact I’d predict that in London Change UK will not get an MEP because not enough people will follow remainvoter.com’s recommendation.

    This all seems horribly complicated. What should I do?

    If you don’t want to bother to think about it, then just go to remainvoter.com and follow their recommendation. If you do want to think, then follow these simple (for a typical reader of this blog anyway) instructions.

    1. Google polls for your region. (For example, you can scroll down to near the bottom of this page to find one set of polls.)

    2. Find out how many MEPs your region gets. Let that number be n.

    3. For each percentage, divide it by 1, 2, 3 etc. until you reach a number that clearly won’t be in the top n.

    4. See what percentage, out of all those numbers, is the nth largest.

    5. Vote for a Remain party that is associated with a number that is close to the threshold if there is also a Brexit-supporting (or Brexit-fence-sitting) party with a number close to the threshold.

    One can refine point 5 as follows, to cover the case when more than one Remain-supporting party has a number near the threshold. Suppose, for the sake of example, that the Brexit party is polling at 32%, the Lib Dems at 22%, the Greens at 11%, Labour at 18% and the Conservatives at 12%, and others 5%, in a region that gets five MEPs. Then carrying out step 3, we get

    Brexit 32, 16, 10.6
    Lib Dems 22, 11
    Greens 11
    Conservatives 12
    Labour 18, 9

    So as things stand the Brexit Party gets two MEPs, the Lib Dems one, Labour one and the Conservatives one. If you’re a Remain supporter who wants to vote tactically, then you’ll want to push one of the Lib Dems and the Greens over 12% to defeat the Conservative candidate. To do that, you’ll need either to increase the Green vote from 11% to 12% or to increase the Lib Dem vote from 22% to 24%. The latter is probably harder, so you should probably support the Greens.

    A final word

    I’m not writing this as an expert, so don’t assume that everything I’ve written is correct, especially given that I came to a different conclusion from remainvoter.com in the South West. If you think I’ve slipped up, then please let me know in the comments, and if I agree with you I’ll make changes. But bear in mind the premise with which I started. Of course there may well be reasons for not voting tactically, such as caring about issues other than Brexit. But this post is about what to do if Brexit is your overriding concern. And one obvious last point: PLEASE ACTUALLY BOTHER TO VOTE! Just the percentage of people voting for Remain-supporting parties will have an impact, even if Farage gets more MEPs.

    http://gowers.wordpress.com/?p=6447
    Extensions
    How Craig Barton wishes he’d taught maths
    Mathematical pedagogy
    A couple of months ago, I can’t remember precisely how, I became aware of a book called How I Wish I’d Taught Maths, by Craig Barton, that seemed to be highly thought of. The basic idea was that Craig Barton is an experienced, and by the sound of things very good, maths teacher who used […]
    Show full content

    A couple of months ago, I can’t remember precisely how, I became aware of a book called How I Wish I’d Taught Maths, by Craig Barton, that seemed to be highly thought of. The basic idea was that Craig Barton is an experienced, and by the sound of things very good, maths teacher who used to take a number of aspects of teaching for granted, until he looked into the mathematics-education literature and came to realize that many of his cherished beliefs were completely wrong. Since I’ve always been interested in the question of how best to teach mathematics, both because of my own university teaching and because from time to time I like to pontificate about school-level teaching, I decided to order the book. More surprisingly, given my past history of buying books that I felt I ought to read, I read it from cover to cover, all 450 pages of it.

    As it happens, the book is ideally designed for people who don’t necessarily want to read it from cover to cover, because it is arranged as follows. At the top level it is divided into chapters. Each chapter starts with a small introduction and thereafter is divided into sections. And each section has precisely the same organization: it is divided into subsections entitled, “What I used to believe”, “Sources of inspiration”, “My takeaway”, and “What I do now”. These are reasonably self-explanatory, but just to spell it out, the first subsection sets out a plausible belief that Craig Barton used to have about good teaching practice, often ending with a rhetorical question such as “What could possibly be wrong with that?”, the second is a list of references (none of which I have yet followed up, but some of them look very interesting), the third is a discussion of what he learned from the references, and the last one is about how he put that into practice. Also, each chapter ends with a short subsection entitled “If I only remember three things …”, where he gives three sentences that sum up what he thinks is most important in the chapter.

    One question I had in the back of my mind when reading the book was whether any of it applied to teaching at university level. I’m still not sure what I think about that. There is a reason to think not, because the focus of the book is very much on school-level teaching, and many of the challenges that arise do not have obvious analogues at university level. For example, he mentioned (on page 235) the following fascinating experiment, where people were asked to do the following multiple-choice question and then justify their answers.

    Which of these values could not represent a probability?

    A. 2/3

    B. 0.72315

    C. 1.46

    D. 0.002

    Let me quote the book itself for a discussion of this question.

    Surely the rule probabilities must be less than or equal to 1 is about as straightforward as it gets in maths? But why, then, did 47% of the 5000+ students who answered this question get it wrong?

    A few students’ explanations reveal all:

    I think B because it’s just a massive decimal and the rest look pretty legit. I also don’t see how a number that big could be correct.

    I think B because you wouldn’t see this in probability questions.

    I think D because you can’t have 0.002 as an answer because it is too low.

    If students are only used to meeting `nice-looking’ probabilities during examples and practice questions, then it is little surprise they come a cropper when they encounter strange-looking answers.

    Could one devise a university-level question that would catch a significant proportion of people out in a similar way? I’m not sure, but here’s an attempt.

    Which of the following is not a vector space with the obvious notions of addition and scalar multiplication?

    A. The set of all complex numbers.

    B. The set of all functions from (0,1) to \mathbb R that are twice differentiable.

    C. The set of all polynomials in x with real coefficients that have x^2+x+1 as a factor.

    D. The set of all triples (a,b,c) of integers.

    E. The set of all sequences (x_1,\dots,x_n)\in\mathbb R^n such that x_1+\dots+x_n=0 and x_1+2x_2+\dots+nx_n=0.

    I think at Cambridge almost everyone would get this question right (though I’d love to do the experiment). But Cambridge mathematics undergraduates have been selected specifically to study mathematics. Perhaps at a US university, before people have chosen their majors, people might be tempted to choose another option (such as B, because vector spaces are to do with algebra and not calculus), while not noting that the obvious scalars in D do not form a field. Or perhaps they wouldn’t like A because the scalar field is the same as the set of vectors (unless, that is, they thought that the obvious scalars were the real numbers).

    More generally, I feel that there are certain kinds of mistakes that are commonly made at school level that are much less common at university level simply because those who survive long enough to reach that stage have been trained not to make them. For example, at university level we become used to formal definitions. Once one is in the habit of using these, deciding whether a structure is a vector space is simply a question of seeing whether the definition of a vector space applies, rather than thinking “Hmm, does that look like the vector spaces I’ve met up to now?” We also become part of a culture where it is common to look at pathological, or at least slightly surprising, examples of concepts, and so on.

    Another reason I decided to read the book was that I have certain prejudices about the teaching of mathematics at school level and I was interested to know whether they would be reinforced by the book or challenged by it. This was a win-win situation, since it is always nice to have one’s prejudices confirmed, but also rather exhilarating to find out that something that seems obviously correct is in fact wrong.

    A prejudice that was strongly confirmed was the value of mathematical fluency. Barton says, and I agree with him (and suggested something like it in my book Mathematics, A Very Short Introduction) that it is often a good idea to teach fluency first and understanding later. More precisely, in order to decide whether it is a good idea, one should assess (i) how difficult it is to give an explanation of why some procedure works and (ii) how difficult it is to learn how to apply the procedure without understanding why it works.

    For instance, suppose you want to teach multiplication of negative numbers. The rule “If they have the same sign then the answer is positive, and if they have different signs then the answer is negative” is a short and straightforward rule, but explaining why -2 times -3 should equal 6 is not very straightforward. So if one begins with the explanation, there is a big risk of conveying the idea that multiplication of negative numbers is a difficult, complicated topic, whereas if one gives plenty of practice in applying the simple rules, then one gives one’s students fluency in an operation that comes up in many other contexts (such as, for instance, multiplying (x-2) by (x-3)), and one can try to justify the rule later, when they are comfortable with the rule itself. I remember enjoying the challenge of thinking about why the rule for dividing one fraction by another was correct, but that was long after I was happy with using the rule itself. I don’t remember being bothered by the lack of justification up to that point.

    As an example in the other direction, Barton gives that of solving linear equations. The danger here is that one can learn a procedure for solving equations such as 2x+3=17, get good at it, and then be completely stuck when faced with an equation such as 4-2x=3x-11. Here a bit of understanding can greatly help. Barton advocates something called the balance method, where one imagines both sides of the equation on a balance, and one is required to make sure that balance is maintained the whole time. I think (but without too much confidence after reading this book) that I would go for something roughly equivalent, but not quite the same, which is to stress the rule you can do the same thing to both sides of an equation (worrying about things like squaring both sides or multiplying by zero later). Then the problem of solving linear equations would be reduced to a kind of puzzle: what can we do to both sides of this equation to make the whole thing look simpler?

    That last question is related to another fascinating nugget that is mentioned in the book. Barton gives an example of a question concerning a parallelogram ABCD, where the angle at A is 105 degrees. The line BC is extended to a point E, which is then joined by an additional line segment to D, and the angle CED is 30 degrees. The question is to prove that the triangle CED is isosceles.

    Apparently, this question is found hard, because one cannot achieve the goal in one step. Instead, one must observe that the angle of the parallelogram at C is also 105 degrees, from which it follows that the angle ECD is 75 degrees. And from that it follows that the angle EDC is 75 degrees as well, and the problem is solved.

    But the interesting thing is that if you change the question to the more open-ended question, “Fill in as many angles in this diagram as you can,” then many people who found the goal-oriented version too hard have no difficulty in filling out all the angles in the diagram and therefore noticing that the triangle CED is isosceles.

    The lesson I would draw from this with the equations question is that instead of asking for a solution to the equation 4-2x=3x-11, it might be better to ask “See whether you can make the equation look simpler by doing something to both sides. If you manage, see if you can then make it even simpler. Keep going until you have made it as simple as you can.” This would of course come after they had already seen several examples of the kind of thing one can do to both sides of an equation.

    Barton isn’t content with just telling the reader that certain methods of teaching are better than others: he also tells us the theory behind them. Of particular importance, he claims, is the fact that we cannot hold very much in our short-term memory. This was music to my ears, as it has long been a belief of mine that the limited capacity of our short-term memory is a hugely important part of the answer to the question of why mathematics looks as it does, by which I mean why, out of all the well-formed mathematical statements one could produce, the ones we find interesting are those particular ones. I have even written about this (in an article entitled Mathematics, Memory and Mental Arithmetic, which unfortunately appeared in a book and is not available online, but I might try to do something about that at some point).

    This basic point informs a lot of the discussion in the book. Consider, for example, a question that asked you to find the perimeter of a rectangle that had side lengths 2/3 and 3/5. This could be a great question, but it is very important to ask it at the right point in the students’ development. If you ask it before they are fluent at adding fractions and at working out perimeters of rectangles, then the amount they have to hold in their heads may well exceed their cognitive capacity: they need to store the fact that you have to add the two lengths, and multiply by 2, and put both fractions over a common denominator. It is to avoid this kind of strain that attaining fluency is so important: it literally makes it easier to think, and in particular to solve the kind of interesting problems we would all like them to be able to solve. Barton absolutely doesn’t dispute the value of interesting problems that mix different parts of mathematics — he just argues, very convincingly, that one has to be careful when to introduce them.

    An idea he discusses a lot, and that I think might perhaps have a role to play in university-level teaching, is what he calls diagnostic questions, and in particular low-stakes diagnostic tests. These typically take the form of a short multiple-choice quiz, and he tries very hard to create a classroom culture where people understand that the purpose of the quiz is not assessment — the quizzes do not “count” for anything — but a tool to help learning, and in particular to help diagnose problems with understanding.

    What makes these questions “diagnostic” is that they are carefully designed in such a way that if you have a certain misconception, then you will be drawn towards a certain wrong answer. That is, the wrong answers people give are informative for the teacher, rather than merely wrong. Here, for example, is a question that fails to be diagnostic followed by a modified version that succeeds.

    A triangle has one side of length 6 and two sides of length 5. What is its area?

    A. 8
    B. 11
    C. 12
    D. 15
    E. 20

    A. 6
    B. 12
    C. 15
    D. 16
    E. 24
    F. 30

    With the second set of choices, each answer has a potential route that one can imagine somebody taking. To obtain the answer 6, one chops the triangle into two right-angled triangles, each of height 4 and base 3, calculates the area of one of them, and forgets to double it. The correct answer is 12. To obtain 15, one takes the formula “half the height times the base” but substitutes in 5 for the height. To obtain 16 one calculates the perimeter. To obtain the answer 24 one takes the height times the base. And to obtain the answer 30 one multiplies the two numbers 6 and 5 together (on the grounds that “to calculate the area you multiply the two numbers together”). Thus, wrong answers yield useful information. With the first set of answers, that just isn’t the case — they are much more likely to be the result of pure guesswork.

    It’s worth mentioning that Terence Tao has created a number of multiple-choice quizzes on university-level topics. He has also blogged about it here. They are not exactly diagnostic in the sense Barton is talking about, but one could imagine trying to make them so.

    Barton uses these diagnostic tests to get a much clearer picture of what his class already understands, before he launches into the discussion of some new topic, than he would by simply asking questions to the class and getting answers from a few keen students. If he diagnoses a fairly serious collective misunderstanding, then he will spend time dealing with that, rather than pointlessly trying to build on shaky foundations.

    I’m jumping around a bit here, but a semi-counterintuitive idea that he advocates, which is apparently backed up by serious research, is what he calls pretesting. This means testing people on material that they have not yet been taught. As long as this is done carefully, so that it doesn’t put students off completely, this turns out to be very valuable, because it prepares the brain to be receptive to the idea that will help to solve that pesky problem. And indeed, after a moment of getting used to the idea, I found it not counterintuitive at all. In fact, it resonates very strongly with my experience as a research mathematician: I find reading other people’s papers very difficult as a rule, but if they can help me solve a problem I’m working on, a lot of that difficulty seems to melt away, because I know exactly what I want, and am looking out for the key idea that will give it to me.

    There’s a great section on the use of artificial “real-world” problems. I think he would agree with me about Use of Maths A-level. As someone he quotes says, “Students are constantly on their guard against being conned into being interested.” An example he discusses is

    Alan drinks 5/8 of a pint of beer. What fraction of his drink is left?

    If the entire point of the exercise is to gain fluency with subtracting fractions, then he advocates just cutting the crap and asking them to calculate 1-5/8, which I agree with 100%.

    If, on the other hand, it is intended as an exercise in stripping away the unnecessary real-world stuff and getting at the underlying mathematics, then he has interesting things to say (later in the book than this section) about the relationship between what he calls the surface structure and the deep structure. The former is to do with the elements of the question that present themselves directly to the student — in this case Alan and the beer — while the deep structure is more like the underlying mathematical question. To train people to uncover the deep structure, it is very important to give them pairs of questions with the same surface structure and different deep structures, and vice versa. Otherwise, they may learn a procedure that works for lots of similar examples and lets them down as soon as a new example comes along with a different deep structure.

    There is lots more in the book — obviously, given its length — but I hope this conveys some of its flavour. The only negative thing I can think of to say is that the word “flipping” is overused — the sentence “Teaching is flipping hard” occurs several times, when once would be enough for one book. But if you’re ready for a bit of jocularity of that kind, then I recommend it, as I found it highly thought provoking. I don’t yet know what the result of that provocation will be, but I’m pretty sure there will be one.

    http://gowers.wordpress.com/?p=6435
    Extensions
    Taylor and Francis doing Trump’s dirty work for him
    News
    The following story arrived in my email inbox (and those of many others) this morning. Apparently a paper was submitted to the Taylor and Francis journal Dynamical Systems, and was accepted. The published version was prepared, and it had got to the stage where a DOI had been assigned. Then the authorS received a letter […]
    Show full content

    The following story arrived in my email inbox (and those of many others) this morning. Apparently a paper was submitted to the Taylor and Francis journal Dynamical Systems, and was accepted. The published version was prepared, and it had got to the stage where a DOI had been assigned. Then the authorS received a letter explaining that “following internal sanctions process checks” the article could not after all be published because one of them was based in Iran.

    I don’t know what the legal consequences would have been if Taylor and Francis had simply gone ahead and published, but my hunch is that they are being unduly cautious. I wonder if they turned down any papers by Russian authors after the invasion of Ukraine.

    This is not an isolated incident. An Iranian PhD student who applied for funding to go to a mathematics conference in Rome was told that “we are unable to provide financial support for Iranians due to administrative difficulties”.

    I’m not sure what one can do about this, but at the very least it should be generally known that it is happening.

    Update. Taylor and Francis have now reversed their decision.

    http://gowers.wordpress.com/?p=6430
    Extensions
    Worrying news from Turkey
    News
    One should of course be concerned when anybody is detained for spurious reasons, but when that person is a noted mathematician, the shock is greater. Six academics have recently been detained in Turkey, of whom one, Betül Tanbay, is due to become vice president of the European Mathematical Society in January. I do not know […]
    Show full content

    One should of course be concerned when anybody is detained for spurious reasons, but when that person is a noted mathematician, the shock is greater. Six academics have recently been detained in Turkey, of whom one, Betül Tanbay, is due to become vice president of the European Mathematical Society in January. I do not know of any petitions for their release, but if they are not released very quickly I hope that there will be a strong reaction. The EMS has issued the following statement.

    The European Mathematical Society is outraged at the news that the Turkish police have detained, in Istanbul on the morning of 16th November 2018, Professor Betül Tanbay, a member of the EMS Executive Committee. We are incredulous at the subsequent press release from the Istanbul Security Directorate accusing her of links to organized crime and attempts to topple the Turkish government.

    Professor Tanbay is a distinguished mathematician and a Vice President Elect of the European Mathematical Society, due to assume that role from January 2019. We have known her for many years as a talented scientist and teacher, a former President of the Turkish Mathematical Society, an open-minded citizen, and a true democrat. She may not hesitate to exercise her freedom of speech, a lawful right that any decent country guarantees its citizens, but it is preposterous to suggest that she could be involved in violent or criminal activities.

    We demand that Professor Tanbay is immediately freed from detention, and we call on the whole European research community to raise its voice against this shameful mistreatment of our colleague, so frighteningly reminiscent of our continent’s darkest times.

    Update. I have just seen this on Twitter:

    Police freed 8 people, incl. professors Turgut Tarhanli and Betul Tanbay, while barring them from overseas travel, & is still questioning 6 others

    http://gowers.wordpress.com/?p=6425
    Extensions
    A quasirandomness implication
    Straight maths
    This is a bit of a niche post, since its target audience is people who are familiar with quasirandom graphs and like proofs of an analytic flavour. Very roughly, a quasirandom graph is one that behaves like a random graph of the same density. It turns out that there are many ways that one can […]
    Show full content

    This is a bit of a niche post, since its target audience is people who are familiar with quasirandom graphs and like proofs of an analytic flavour. Very roughly, a quasirandom graph is one that behaves like a random graph of the same density. It turns out that there are many ways that one can interpret the phrase “behaves like a random graph” and, less obviously, that they are all in a certain sense equivalent. This realization dates back to seminal papers of Thomason, and of Chung, Graham and Wilson.

    I was lecturing on the topic recently, and proving that certain of the quasirandomness properties all implied each other. In some cases, the proofs are quite a bit easier if you assume that the graph is regular, and in the past I have sometimes made my life easier by dealing just with that case. But that had the unfortunate consequence that when I lectured on Szemerédi’s regularity lemma, I couldn’t just say “Note that the condition on the regular pairs is just saying that they have quasirandomness property n” and have as a consequence all the other quasirandomness properties. So this year I was determined to deal with the general case, and determined to find clean proofs of all the implications. There is one that took me quite a bit of time, but I got there in the end. It is very likely to be out there in the literature somewhere, but I haven’t found it, so it seems suitable for a blog post. I can be sure of at least one interested reader, which is the future me when I find that I’ve forgotten the argument (except that actually I have now found quite a conceptual way of expressing it, so it’s just conceivable that it will stick around in the more accessible part of my long-term memory).

    The implication in question, which I’ll state for bipartite graphs, concerns the following two properties. I’ll state them qualitatively first, and then give more precise versions. Let G be a bipartite graph of density \delta with (finite) vertex sets X and Y.

    Property 1. If A\subset X and B\subset Y, then the number of edges between A and B is roughly \delta|A||B|.

    Property 2. The number of 4-cycles (or more precisely ordered quadruples (x_1,y_1,x_2,y_2) such that x_iy_j is an edge for all four choices of i,j) is roughly \delta^4|X|^2|Y|^2.

    A common way of expressing property 1 is to say that the density of the subgraph induced by A and B is approximately \alpha as long as the sets A and B are not too small. However, the following formulation leads to considerably tidier proofs if one wants to use analytic arguments. I think of G as a function, so G(x,y)=1 if xy is an edge of the graph and 0 otherwise. Then the condition is that if A has density \alpha in X and B has density \beta in Y, then

    |\mathbb E_{x,y}G(x,y)A(x)B(y)-\delta\alpha\beta|\leq c_1

    This condition is interesting when c_1 is small. It might seem more natural to write c_1\alpha\beta on the right-hand side, but then one has to add some condition such as that \alpha,\beta\geq c_1 in order to obtain a condition that follows from the other conditions. If one simply leaves the right-hand side as c_1 (which may depend on \delta), then one obtains a condition that automatically gives a non-trivial statement when A and B are large and a trivial one when they are small.

    As for Property 2, the most natural analytic way of expressing it is the inequality

    \displaystyle \mathop{\mathbb E}_{x_1,x_2\in X}\mathop{\mathbb E}_{y_1,y_2\in Y}G(x_1,y_1)G(x_1,y_2)G(x_2,y_1)G(x_2,y_2)\leq\delta^4(1+c_2)

    An easy Cauchy-Schwarz argument proves a lower bound of \delta^4, so this does indeed imply that the number of labelled 4-cycles is approximately \delta^4|X|^2|Y|^2.

    The equivalence between the two statements is that if one holds for a sufficiently small c_i, then the other holds for a c_j that is as small as you want.

    In fact, both implications are significantly easier in the regular case, but I found a satisfactory way of deducing the first from the second a few years ago and won’t repeat it here, as it requires a few lemmas and some calculation. But it is given as Theorem 5.3 in these notes, and the proof in question can be found by working backwards. (The particular point that is easier in the regular case is Corollary 5.2, because then the function g' mentioned there and in Lemma 5.1 is identically zero.)

    What I want to concentrate on is deducing the second property from the first. Let me first give the proof in the regular case, which is very short and sweet. We do it by proving the contrapositive. So let’s assume that every vertex in X has degree \delta Y, that every vertex in Y has degree \delta X, and that

    \displaystyle \mathop{\mathbb E}_{x_1,x_2\in X}\mathop{\mathbb E}_{y_1,y_2\in Y}G(x_1,y_1)G(x_1,y_2)G(x_2,y_1)G(x_2,y_2)>\delta^4(1+c_2)

    Since for each fixed x_2,y_2, the expectation over x_1,y_1 on the left-hand side is zero if G(x_2,y_2)=0, it follows that there exists some choice of x_2,y_2 such that

    \displaystyle \mathop{\mathbb E}_{x_1\in X}\mathop{\mathbb E}_{y_1\in Y}G(x_1,y_1)G(x_1,y_2)G(x_2,y_1)>\delta^3(1+c_2)

    We now set A=\{x:G(x,y_2)=1\} and B=\{y:G(x_2,y)=1\}. Then A and B both have density \delta (by the regularity assumption), and the above inequality tells us that

    \mathbb E_{x_1,y_1}G(x_1,y_1)A(x_1)B(y_1)-\delta^3>c_2\delta^3,

    so we obtain the desired conclusion with c_1=c_2\delta^3. Or to put it another way, if we assume Property 1 with constant c_1, then we obtain Property 2 with c_2=\delta^{-3}c_1 (which in practice means that c_1 should be small compared with \delta^3 in order to obtain a useful inequality from Property 2).

    The difficulty when G is not regular is that A and B may have densities larger than \delta, and then the inequality

    \mathbb E_{x_1,y_1}G(x_1,y_1)A(x_1)B(y_1)-\delta^3>c_2\delta^3,

    no longer gives us what we need. The way I used to get round this was what I think of as the “disgusting” approach, which goes roughly as follows. Suppose that many vertices in X have degree substantially larger than \delta|Y|, and let A be the set of all such vertices. Then the number of edges between A and Y is too large, and we get the inequality we are looking for (with B=Y). We can say something similar about vertices in Y, so either we are done or G is at least approximately regular, and in particular almost all neighbourhoods are of density not much greater than \delta. Then one runs an approximate version of the simple averaging argument above, arguing in an ugly way that the contribution to the average from “bad” choices (x_2,y_2) is small enough that there must be at least one “good” choice.

    To obtain a cleaner proof, I’ll begin with a non-essential step, but one that I think clarifies what is going on and shows that it’s not just some calculation that magically gives the desired answer. It is to interpret the quantity

    \displaystyle \mathop{\mathbb E}_{x_1,x_2\in X}\mathop{\mathbb E}_{y_1,y_2\in Y}G(x_1,y_1)G(x_1,y_2)G(x_2,y_1)G(x_2,y_2)

    as the probability that the four pairs x_iy_j are all edges of G if x_1,x_2 are chosen independently at random from X and y_1,y_2 are chosen independently at random from Y. Then our hypothesis is that

    \mathbb P[all x_iy_j are edges]-\delta^4>c_2\delta^4

    Let us now split up the left-hand side as follows. It is the sum of

    \mathbb P[all x_iy_j are edges]
    -\delta\mathbb P[x_1y_2,x_2y_1,x_2y_2 are edges]

    and

    \delta\mathbb P[x_1y_1,x_1y_2,x_2y_1 are edges]
    -\delta^2\mathbb P[x_1y_2,x_2y_1 are edges],

    where I have used the fact that the probability that x_1y_2 and x_2y_1 are both edges is exactly \delta^2.

    One or other of these two terms must be at least c_2\delta^4/2. If it is the first, then averaging gives us x_2,y_2 such that

    \mathbb E_{x_1,y_1}(G(x_1,y_1)-\delta)G(x_1,y_2)G(x_2,y_1)\geq c_2\delta^3/2

    and now we can define A and B just as in the regular case, obtaining the same result apart from an extra factor of 1/2.

    If it is the second, then averaging over x_1,y_1 this time and dividing by \delta gives us some x_1,y_1 such that the inequality

    \mathbb E_{x_2,y_2}G(x_1,y_2)G(x_2,y_1)(G(x_2,y_2)-\delta)\geq c_2\delta^3/2.

    So again we are done, this time with the roles of 1 and 2 reversed.

    I haven’t checked the details, but it is clear that one could run this argument for any subgraph H that occurs the “wrong” number of times. The extra factor one would need to divide by would be the number of edges you need to remove from H to obtain a matching.

    Of course, the fact that Property 1 implies that all small subgraphs occur with roughly the right frequency is a very well known fact about quasirandom graphs, but it is usually proved using an almost regularity assumption, which I wanted to avoid.

    http://gowers.wordpress.com/?p=6410
    Extensions
    Additional thoughts on the Ted Hill paper
    News
    First, I’d like to thank the large number of commenters on my previous post for keeping the discussion surprisingly calm and respectful given the topic discussed. In that spirit, and to try to practise the scientific integrity that I claimed to care about, I want to acknowledge that my views about the paper have changed […]
    Show full content

    First, I’d like to thank the large number of commenters on my previous post for keeping the discussion surprisingly calm and respectful given the topic discussed. In that spirit, and to try to practise the scientific integrity that I claimed to care about, I want to acknowledge that my views about the paper have changed somewhat as a result of the discussion. My understanding of the story of what happened to the paper has changed even more now that some of those attacked in Ted Hill’s Quillette article have responded, but about that I only want to repeat what I said in one or two comments on the previous post: that my personal view is that one should not “unaccept” or “unpublish” a paper unless something was improper about the way it was accepted or published, and that that is also the view of the people who were alleged to have tried to suppress Ted Hill’s paper on political grounds. I would also remark that whatever happened at NYJM would not have happened if all decisions had to be taken collectively by the whole editorial board, which is the policy on several journals I have been on the board of. According to Igor Rivin, the policy at NYJM is very different: “No approval for the full board is required, or ever obtained. The approval of the Editor in Chief is not required.” I find this quite extraordinary: it would seem to be a basic safeguard that decisions should be taken by more than one person — ideally many more.

    To return to the paper, I now see that the selectivity hypothesis, which I said I found implausible, was actually quite reasonable. If you look carefully at my previous post, you will see that I actually started to realize that even when writing it, and it would have been more sensible to omit that criticism entirely, but by the time it occurred to me that ancient human females could well have been selective in a way that could (in a toy model) be reasonably approximated by Hill’s hypothesis, I had become too wedded to what I had already written — a basic writer’s mistake, made in this case partly because I had only a short window of time in which to write the post. I’m actually quite glad I left the criticism in, since I learnt quite a lot from the numerous comments that defended the hypothesis.

    I had a similar experience with a second criticism: the idea of dividing the population up into two subpopulations. That still bothers me somewhat, since in reality we all have large numbers of genes that interact in complicated ways and it is not clear that a one-dimensional model will be appropriate for a high-dimensional feature space. But perhaps for a toy model intended to start a discussion that is all right.

    While I’m at it, some commenters on the previous post came away with the impression that I was against toy models. I agree with the following words, which appeared in a book that was published in 2002.

    There are many ways of modelling a given physical situation and we must use a mixture of experience and further theoretical considerations to decide what a given model is likely to teach us about the world itself. When choosing a model, one priority is to make its behaviour correspond closely to the actual, observed behaviour of the world. However, other factors, such as simplicity and mathematical elegance, can often be more important. Indeed, there are very useful models with almost no resemblance to the world at all …

    But that’s not surprising, since I was the author of the book.

    But there is a third feature of Hill’s model that I still find puzzling. Some people have tried to justify it to me, but I found that either I understood the justifications and found them unconvincing or I didn’t understand them. I don’t rule out the possibility that some of the ones I didn’t understand were reasonable defences of this aspect of the model, but let me lay out once again the difficulty I have.

    To do this I’ll briefly recall Hill’s model. You have two subpopulations P and Q of, let us say, the males of a species. (It is not important for the model that they are male, but that is how Hill hopes the model will be applied.) The distribution of desirability of subpopulation P is more spread out than that of subpopulation Q, so if the females of the species choose to reproduce only with males above a rather high percentile of desirability, they will pick a greater proportion of subpopulation P than of subpopulation Q.

    A quick aside is that what I have just written is more or less the entire actual content (as opposed to surrounding discussion) of Hill’s paper. Of course, he has to give a precise definition of “more spread out”, but it is very easy to come up with a definition that will give the desired conclusion after a one-line argument, and that is what he does. He also gives a continuous-time version of the process. But I’m not sure what adding a bit of mathematical window dressing really adds, since the argument in the previous paragraph is easy to understand and obviously correct. But of course without that window dressing the essay couldn’t hope to sell itself as a mathematics paper.

    The curious feature of the model, and the one that I still find hard to accept, is that Hill assumes, and absolutely needs to assume, that the only thing that can change is the sizes of the subpopulations and not the distributions of desirability within those populations. So if, for example, what makes a male desirable is height, and if the average heights in the two populations are the same, then even though females refuse to reproduce with anybody who isn’t unusually tall, the average height of males remains the same.

    The only way this strange consequence can work, as far as I can see, is if instead of there being a gene (or combination of genes) that makes men tall, there is a gene that has some complicated effect of which a side-effect is that the height of men is more variable, and moreover there aren’t other genes that simply cause tallness.

    It is hard to imagine what the complicated effect might be in the case of height, but it is not impossible to come up with speculations about mathematical ability. For example, maybe men have, as has been suggested, a tendency to be a bit further along the autism spectrum than women, which causes some of them to become very good at mathematics and others to lack the social skills to attract a mate. But even by the standards of evolutionary just-so stories, that is not a very good one. Our prehistoric ancestors were not doing higher mathematics, so we would need to think of some way that being on the spectrum could have caused a man at that time to become highly attractive to women. One has to go through such contortions to make the story work, when all along there is the much more straightforward possibility that there is some complex mix of genes that go towards making somebody intelligent, and that if prehistoric women went for intelligent men, then those genes would be selected for. But if that is what happened, then the proportion of less intelligent men would go down, and therefore the variability would go down.

    While writing this, I have realized that there is a crucial assumption of Hill’s, the importance of which I had not appreciated. It’s that the medians of his two subpopulations are the same. Suppose instead that the individuals in male population P are on average more desirable than the individuals in male population Q. Then even if population P is less variable than population Q, if females are selective, it may very well be that a far higher proportion of population P is chosen than of population Q, and therefore a tendency for the variability of the combined population to decrease. In fact, we don’t even need to assume that P is less variable than Q: if the population as a whole becomes dominated by P, it may well be less variable than the original combination of populations P and Q.

    So for Hill’s model to work, it needs a fairly strange and unintuitive combination of hypotheses. Therefore, if he proposes it as a potential explanation for greater variability amongst males, he needs to argue that this combination of hypotheses might actually have occurred for many important features. For example, if it is to explain greater variability for males in mathematics test scores, then he appears to need to argue (i) that there was a gene that made our prehistoric male ancestors more variable with respect to some property that at one end of the scale made them more desirable to females, (ii) that this gene had no effect on average levels of desirability, (iii) that today this curious property has as a side-effect greater variability in mathematics test scores, and (iv) this tendency to increase variability is not outweighed by reduction of variability due to selection of other genes that do affect average levels. (Although he explicitly says that he is not trying to explain any particular instance of greater variability amongst males, most of the references he gives concerning such variability are to do with intellectual ability, and if he can’t give a convincing story about that, then why have all those references?)

    Thus, what I object to is not the very idea of a toy model, but more that with this particular toy model I have to make a number of what seem to me to be highly implausible assumptions to get it to work. And I don’t mean the usual kind of entirely legitimate simplifying assumptions. Rather, I’m talking about artificial assumptions that seem to be there only to get the model to do what Hill wants it to do. If some of the hypotheses above that seem implausible to me have in fact been observed by biologists, it seems to me that Hill should have included references to the relevant literature in his copious bibliography.

    As with my previous post, I am not assuming that everything I’ve just written is right, and will be happy to be challenged on the points above.

    http://gowers.wordpress.com/?p=6405
    Extensions
    Has an uncomfortable truth been suppressed?
    News
    Update to post, added 11th September. As expected, there is another side to the story discussed below. See this statement about the decision by the Mathematical Intelligencer and this one about the decision taken by the New York Journal of Mathematics. Further update, added 15th September. The author has also made a statement. I was […]
    Show full content

    Update to post, added 11th September. As expected, there is another side to the story discussed below. See this statement about the decision by the Mathematical Intelligencer and this one about the decision taken by the New York Journal of Mathematics.

    Further update, added 15th September. The author has also made a statement.


    I was disturbed recently by reading about an incident in which a paper was accepted by the Mathematical Intelligencer and then rejected, after which it was accepted and published online by the New York Journal of Mathematics, where it lasted for three days before disappearing and being replaced by another paper of the same length. The reason for this bizarre sequence of events? The paper concerned the “variability hypothesis”, the idea, apparently backed up by a lot of evidence, that there is a strong tendency for traits that can be measured on a numerical scale to show more variability amongst males than amongst females. I do not know anything about the quality of this evidence, other than that there are many papers that claim to observe greater variation amongst males of one trait or another, so that if you want to make a claim along the lines of “you typically see more males both at the top and the bottom of the scale” then you can back it up with a long list of citations.

    You can see, or probably already know, where this is going: some people like to claim that the reason that women are underrepresented at the top of many fields is simply that the top (and bottom) people, for biological reasons, tend to be male. There is a whole narrative, much loved by many on the political right, that says that this is an uncomfortable truth that liberals find so difficult to accept that they will do anything to suppress it. There is also a counter-narrative that says that people on the far right keep on trying to push discredited claims about the genetic basis for intelligence, differences amongst various groups, and so on, in order to claim that disadvantaged groups are innately disadvantaged rather than disadvantaged by external circumstances.

    I myself, as will be obvious, incline towards the liberal side, but I also care about scientific integrity, so I felt I couldn’t just assume that the paper in question had been rightly suppressed. I read an article by the author that described the whole story (in Quillette, which rather specializes in this kind of story), and it sounded rather shocking, though one has to bear in mind that since the article is written by a disgruntled author, there is almost certainly another side to the story. In particular, he is at pains to stress that the paper is simply a mathematical theory to explain why one sex might evolve to become more variable than another, and not a claim that the theory applies to any given species or trait. In his words, “Darwin had also raised the question of why males in many species might have evolved to be more variable than females, and when I learned that the answer to his question remained elusive, I set out to look for a scientific explanation. My aim was not to prove or disprove that the hypothesis applies to human intelligence or to any other specific traits or species, but simply to discover a logical reason that could help explain how gender differences in variability might naturally arise in the same species.”

    So as I understood the situation, the paper made no claims whatsoever about the real world, but simply defined a mathematical model and proved that in this model there would be a tendency for greater variability to evolve in one sex. Suppressing such a paper appeared to make no sense at all, since one could always question whether the model was realistic. Furthermore, suppressing papers on this kind of topic simply plays into the hands of those who claim that liberals are against free speech, that science is not after all objective, and so on, claims that are widely believed and do a lot of damage.

    I was therefore prompted to look at the paper itself, which is on the arXiv, and there I was met by a surprise. I was worried that I would find it convincing, but in fact I found it so unconvincing that I think it was a bad mistake by Mathematical Intelligencer and the New York Journal of Mathematics to accept it, but for reasons of mathematical quality rather than for any controversy that might arise from it. To put that point more directly, if somebody came up with a plausible model (I don’t insist that it should be clearly correct) and showed that subject to certain assumptions about males and females one would expect greater variability to evolve amongst males, then that might well be interesting enough to publish, and certainly shouldn’t be suppressed just because it might be uncomfortable, though for all sorts of reasons that I’ll discuss briefly later, I don’t think it would be as uncomfortable as all that. But this paper appears to me to fall well short of that standard.

    To justify this view, let me try to describe what the paper does. Its argument can be summarized as follows.

    1. Because in many species females have to spend a lot more time nurturing their offspring than males, they have more reason to be very careful when choosing a mate, since a bad choice will have more significant consequences.

    2. If one sex is more selective than the other, then the less selective sex will tend to become more variable.

    To make that work, one must of course define some kind of probabilistic model in which the words “selective” and “variable” have precise mathematical definitions. What might one expect these to be? If I hadn’t looked at the paper, I think I’d have gone for something like this. An individual A of one sex will try to choose as desirable a mate B as possible amongst potential mates that would be ready to accept A as a mate. To be more selective would simply mean to make more of an effort to optimize the mate, which one would model in some suitable probabilistic way. One feature of this model would presumably be that a less attractive individual would typically be able to attract less desirable mates.

    I won’t discuss how variability is defined, except to say that the definition is, as far as I can see, reasonable. (For normal distributions it agrees with standard deviation.)

    The definition of selectivity in the paper is extremely crude. The model is that individuals of one sex will mate with individuals of the other sex if and only if they are above a certain percentile in the desirability scale, a percentile that is the same for everybody. For instance, they might only be prepared to choose a mate who is in the top quarter, or the top two thirds. The higher the percentile they insist on, the more selective that sex is.

    When applied to humans, this model is ludicrously implausible. While it is true that some males have trouble finding a mate, the idea that some huge percentage of males are simply not desirable enough (as we shall see, the paper requires this percentage to be over 50) to have a chance of reproducing bears no relation to the world as we know it.

    I suppose it is just about possible that an assumption like this could be true of some species, or even of our cave-dwelling ancestors — perhaps men were prepared to shag pretty well anybody, but only some small percentage of particularly hunky men got their way with women — but that isn’t the end of what I find dubious about the paper. And even if we were to accept that something like that had been the case, it would be a huge further leap to assume that what made somebody desirable hundreds of thousands of years ago was significantly related to what makes somebody good at, say, mathematical research today.

    Here is one of the main theorems of the paper, with a sketch of the proof. Suppose you have two subpopulations P and Q within one of the two sexes, with P being of more varied attractiveness than Q. And suppose that the selectivity cutoff for the other sex is that you have to be in the top 40 percent attractiveness-wise. Then because P is more concentrated on the extremes than Q, a higher proportion of subpopulation P will be in that percentile. (This can easily be made rigorous using the notion of variability in the paper.) By contrast, if the selectivity cutoff is that you have to be in the top 60 percent, then a higher proportion of subpopulation Q will be chosen.

    I think we are supposed to conclude that subpopulation P is therefore favoured over subpopulation Q when the other sex is selective, and not otherwise, and therefore that variability amongst males tends to be selected for, because females tend to be more choosy about their mates.

    But there is something very odd about this. Those poor individuals at the bottom of population P aren’t going to reproduce, so won’t they die out and potentially cause population P to become less variable? Here’s what the paper has to say.

    Thus, in this discrete-time setting, if one sex remains selective from each generation to the next, for example, then in each successive generation more variable subpopulations of the opposite sex will prevail over less variable subpopulations with comparable average desirability. Although the desirability distributions themselves may evolve, if greater variability prevails at each step, that suggests that over time the opposite sex will tend toward greater variability.

    Well I’m afraid that to me it doesn’t suggest anything of the kind. If females have a higher cutoff than males, wouldn’t that suggest that males would have a much higher selection pressure to become more desirable than females? And wouldn’t the loss of all those undesirable males mean that there wasn’t much one could say about variability? Imagine for example if the individuals in P were all either extremely fit or extremely unfit. Surely the variability would go right down if only the fit individuals got to reproduce. And if you’re worrying that the model would in fact show that males would tend to become far superior to females, as opposed to the usual claim that males are more spread out both at the top and at the bottom, let’s remember that males inherit traits from both their fathers and their mothers, as do females, an observation that, surprisingly, plays no role at all in the paper.

    What is the purpose of the strange idea of splitting into two subpopulations and then ignoring the fact that the distributions may evolve (and why just “may” — surely “will” would be more appropriate)? Perhaps the idea is that a typical gene (or combination of genes) gives rise not to qualities such as strength or intelligence, but to more obscure features that express themselves unpredictably — they don’t necessarily make you stronger, for instance, but they give you a bigger range of strength possibilities. But is there the slightest evidence for such a hypothesis? If not, then why not just consider the population as a whole? My guess is that you just don’t get the desired conclusion if you do that.

    I admit that I have not spent as long thinking about the paper as I would need to in order to be 100% confident of my criticisms. I am also far from expert in evolutionary biology and may therefore have committed some rookie errors in what I have written above. So I’m prepared to change my mind if somebody (perhaps the author?) can explain why the criticisms are invalid. But as it looks to me at the time of writing, the paper isn’t a convincing model, and even if one accepts the model, the conclusion drawn from the main theorem is not properly established. Apparently the paper had a very positive referee’s report. The only explanation I can think of for that is that it was written by somebody who worked in evolutionary biology, didn’t really understand mathematics, and was simply pleased to have what looked like a rigorous mathematical backing for their theories. But that is pure speculation on my part and could be wrong.

    I said earlier that I don’t think one should be so afraid of the genetic variability hypothesis that one feels obliged to dismiss all the literature that claims to have observed greater variability amongst males. For all I know it is seriously flawed, but I don’t want to have to rely on that in order to cling desperately to my liberal values.

    So let’s just suppose that it really is the case that amongst a large number of important traits, males and females have similar averages but males appear more at the extremes of the distribution. Would that help to explain the fact that, for example, the proportion of women decreases as one moves up the university hierarchy in mathematics, as Larry Summers once caused huge controversy by suggesting? (It’s worth looking him up on Wikipedia to read his exact words, which are more tentative than I had realized.)

    The theory might appear to fit the facts quite well: if men and women are both normally distributed with the same mean but men have a greater variance than women, then a randomly selected individual from the top x percent of the population will be more and more likely to be male the smaller x gets. That’s just simple mathematics.

    But it is nothing like enough reason to declare the theory correct. For one thing, it is just as easy to come up with an environmental theory that would make a similar prediction. Let us suppose that the way society is organized makes it harder for women to become successful mathematicians than for men. There are all sorts of reasons to believe that this is the case: relative lack of role models, an expectation that mathematics is a masculine pursuit, more disruption from family life (on average), distressing behaviour by certain male colleagues, and so on. Let’s suppose that the result of all these factors is that the distribution of whatever it takes for women to make a success of mathematics has a slightly lower mean than that for men, but roughly the same variance, with both distributions normal. Then again one finds by very basic mathematics that if one picks a random individual from the top x percent, that individual will be more and more likely to be male as x gets smaller. But in this case, instead of throwing up our hands and saying that we can’t fight against biology, we will say that we should do everything we can to compensate for and eventually get rid of the disadvantages experienced by women.

    A second reason to be sceptical of the theory is that it depends on the idea that how good one is at mathematics is a question of raw brainpower. But that is a damaging myth that puts many people off doing mathematics who could have enjoyed it and thrived at it. I have often come across students who astound me with their ability to solve problems far more quickly than I can, (not all of them male). Some of them go on to be extremely successful mathematicians, but not all. And some who seem quite ordinary go on to do extraordinary things later on. It is clear that while an unusual level of raw brainpower, whatever that might be, often helps, it is far from necessary and far from sufficient for becoming a successful mathematician: it is part of a mix that includes dedication, hard work, enthusiasm, and often a big slice of luck. And as one gains in experience, one gains in brainpower — not raw any more, but who cares whether it is hardware or software? So even if it turned out that the genetic variability hypothesis was correct and could be applied to something called raw mathematical brainpower, a conclusion that would be very hard to establish convincingly (it’s certainly not enough to point out that males find it easier to visualize rotating 3D objects in their heads), that still wouldn’t imply that it is pointless to try to correct the underrepresentation of women amongst the higher ranks of mathematicians. When I was a child, almost all doctors and lawyers were men, and during my lifetime I have seen that change completely. The gender imbalance amongst mathematicians has changed more slowly, but there is no reason in principle that the pace couldn’t pick up substantially. I hope to live to see that happen.

    http://gowers.wordpress.com/?p=6390
    Extensions
    A new journal in combinatorics
    Mathematics on the internetNews
    This post is to announce that a new journal, Advances in Combinatorics, has just opened for submissions. I shall also say a little about the journal, about other new journals, about my own experiences of finding journals I am happy to submit to, and about whether we are any nearer a change to more sensible […]
    Show full content

    This post is to announce that a new journal, Advances in Combinatorics, has just opened for submissions. I shall also say a little about the journal, about other new journals, about my own experiences of finding journals I am happy to submit to, and about whether we are any nearer a change to more sensible systems of dissemination and evaluation of scientific papers.

    Advances in Combinatorics

    Advances in Combinatorics is set up as a combinatorics journal for high-quality papers, principally in the less algebraic parts of combinatorics. It will be an arXiv overlay journal, so free to read, and it will not charge authors. Like its cousin Discrete Analysis (which has recently published its 50th paper) it will be run on the Scholastica platform. Its minimal costs are being paid for by the library at Queen’s University in Ontario, which is also providing administrative support. The journal will start with a small editorial board. Apart from me, it will consist of Béla Bollobás, Reinhard Diestel, Dan Kral, Daniela Kühn, James Oxley, Bruce Reed, Gabor Sarkozy, Asaf Shapira and Robin Thomas. Initially, Dan Kral and I will be the managing editors, though I hope to find somebody to replace me in that role once the journal is established. While I am posting this, Dan is simultaneously announcing the journal at the SIAM conference in Discrete Mathematics, where he has just given a plenary lecture. The journal is also being announced by COAR, the Confederation of Open Access Repositories. This project aligned well with what they are trying to do, and it was their director, Kathleen Shearer, who put me in touch with the library at Queen’s.

    As with Discrete Analysis, all members of the editorial board will be expected to work: they won’t just be lending their names to give the journal bogus prestige. Each paper will be handled by one of the editors, who, after obtaining external opinions (when the paper warrants them) will make a recommendation to the rest of the board. All decisions will be made collectively. The job of the managing editors will be to make sure that this process runs smoothly, but when it comes to decisions, they will have no more say than any other editor.

    The rough level that the journal is aiming at is that of a top specialist journal such as Combinatorica. The reason for setting it up is that there is a gap in the market for an “ethical” combinatorics journal at that level — that is, one that is not published by one of the major commercial publishers, with all the well known problems that result. We are not trying to destroy the commercial combinatorial journals, but merely to give people the option of avoiding them if they would prefer to submit to a journal that is not complicit in a system that uses its monopoly power to ruthlessly squeeze library budgets.

    We are not the first ethical journal in combinatorics. Another example is The Electronic Journal of Combinatorics, which was set up by Herb Wilf back in 1994. The main difference between EJC and Advances in Combinatorics is that we plan to set a higher bar for acceptance, even if it means that we accept only a small number of papers. (One of the great advantages of a fully electronic journal is that we do not have a fixed number of issues per year, so we will not have to change our standards artificially in order to fill issues or clear backlogs.) We thus hope that EJC and AIC will between them offer suitable potential homes for a wide range of combinatorics papers. And on the more algebraic side, one should also mention Algebraic Combinatorics, which used to be the Springer journal The Journal of Algebraic Combinatorics (which officially continues with an entirely replaced editorial board — I don’t know whether it’s getting many submissions though), and also the Australasian Journal of Combinatorics.

    So if you’re a combinatorialist who is writing up a result that you think is pretty good, then please consider submitting it to us. What do we mean by “pretty good”? My personal view — that is, I am not speaking for the rest of the editorial board — is that the work in a good paper should have a clear reason for others to be interested in it (so not, for example, incremental progress in some pet project of the author) and should have something about it that makes it count as a significant achievement, such as solving a well-known problem, clearing a difficult technical hurdle, inventing a new and potentially useful technique, or giving a beautiful and memorable proof.

    What other ethical journals are there?

    Suppose that you want to submit an article to a journal that is free to read and does not charge authors. What are your options? I don’t have a full answer to this question, so I would very much welcome feedback from other people, especially in areas of mathematics far from my own, about what the options are for them. But a good starting point is to consult the list of current member journals in the Free Journal Network, which Advances in Combinatorics hopes to join in due course.

    Three notable journals not on that list are the following.

    1. Acta Mathematica. This is one of a tiny handful of the very top journals in mathematics. Last year it became fully open access without charging author fees. So for a really good paper it is a great option.
    2. Annales Henri Lebesgue. This is a new journal that has not yet published any articles, but is open for submissions. Like Acta Mathematica, it covers all of mathematics. It aims for a very high standard, but it is not yet clear what that means in practice: I cannot say that it will be roughly at the level of Journal X. But perhaps it will turn out to be suitable for a very good paper that is just short of the level of Annals, Acta, or JAMS.
    3. Algebra and Number Theory. I am told that this is regarded as the top specialist journal in number theory. From a glance at the article titles, I don’t see much analytic number theory, but there are notable analytic number theorists on the editorial board, so perhaps I have just not looked hard enough.

    Added later: I learn from Benoît Kloeckner and Emmanuel Kowalski in the comments below that my information about Algebra and Number Theory was wrong, since articles in that journal are not free to read until they are five years old. However, it is published by MSP, which is a nonprofit organization, so as subscription journals go it is at the ethical end of the spectrum.

    Further update: I have heard from the editors of Annales Henri Lebesgue that they have had a number of strong submissions and expect the level of the journal to be at least as high as that of journals such as Advances in Mathematics, Mathematische Annalen and the Israel Journal of Mathematics, and perhaps even slightly higher.

    In what areas are ethical journals most needed?

    I would very much like to hear from people who would prefer to avoid the commercially published journals, but can’t, because there are no ethical journals of a comparable standard in their area. I hope that combinatorialists will no longer have that problem. My impression is that there is a lack of suitable journals in analysis and I’m told that the same is true of logic. I’m not quite sure what the situation is in geometry or algebra. (In particular, I don’t know whether Algebra and Number Theory is also considered as the top specialist journal for algebraists.) Perhaps in some areas there are satisfactory choices for papers of some standards but not of others: that too would be interesting to know. Where do you think the gaps are? Let me know in the comments below.

    Starting a new journal.

    I want to make one point loud and clear, which is that the mechanics of starting a new, academic-run journal are now very easy. Basically, the only significant obstacle is getting together an editorial board with the right combination of reputation in the field and willingness to work. What’s more, unless the journal grows large, the work is quite manageable — all the more so if it is spread reasonably uniformly amongst the editorial board. Creating the journal itself can be done on one of a number of different platforms, either for no charge or for a very small charge. Some examples are the Mersenne platform, which hosts the Annales Henri Lebesgue, the Episciences platform, which hosts the Epijournal de Géométrie Algébrique, and Scholastica, which, as I mentioned above, hosts Discrete Analysis and Advances in Combinatorics.

    Of these, Scholastica charges a submission fee of $10 per article and the other two are free. There are a few additional costs — for example, Discrete Analysis pays a subscription to CrossRef in order to give DOIs to articles — but the total cost of running a new journal that isn’t too large is of the order of a few hundred dollars per year, as long as nobody is paid for what they do. (Discrete Analysis, like Advances in Combinatorics, gets very useful assistance from librarians, provided voluntarily, but even if they were paid the going rate, the total annual costs would be of the same order of magnitude as one “article processing charge” of the traditional publishers, which is typically around $1500 per article.)

    What’s more, those few hundred dollars are not an obstacle either. For example, I know of a fund that is ready to support at least one other journal of a similar size to Discrete Analysis, there are almost certainly other libraries that would be interested in following the enlightened example of Queen’s University Library and supporting a journal (if you are a librarian reading this, then I strongly recommend doing so, as it will be helping to weaken the hold of the system that is currently costing you orders of magnitude more money), and I know various people who know about other means of obtaining funding. So if you are interested in starting a journal and think you can put together a credible editorial board, then get in touch: I can offer advice, funding (if the proposal looks a good one), and contact with several other people who are knowledgeable and keen to help.

    A few remarks about my own relationship with mathematical publishing.

    My attitudes to journals and the journal system have evolved quite a lot in the last few years. The alert reader may have noticed that I’ve got a long way through this post before mentioning the E-word. I still think that Elsevier is the publisher that does most damage, and have stuck rigidly to my promise made over six years ago not to submit a paper to them or to do editorial or refereeing work. However, whereas then I thought of Springer as somehow more friendly to mathematics, thanks to its long tradition of publishing important textbooks and monographs, I now feel pretty uncomfortable about all the big four — Elsevier, Springer, Wiley, and Taylor and Francis — with Springer having got a whole lot worse after merging with Nature Macmillan. And in some respects Elsevier is better than Springer: for example, they make all mathematics papers over four years old freely available, while Springer refuses to do so. Admittedly this was basically a sop to mathematicians to keep us quiet, but as sops go it was a pretty good one, and I see now that Elsevier’s open archive, as they call it, includes some serious non-mathematical journals such as Cell. (See their list of participating journals for details.)

    I’m also not very comfortable with the society journals and university presses, since although they use their profits to benefit mathematics in various ways, they are fully complicit in the system of big deals, the harm of which outweighs those benefits.

    The result is that if I have a paper to submit, I tend to have a lot of trouble finding a suitable home for it, and I end up having to compromise on my principles to some extent (particularly if, as happened recently, I have a young coauthor from a country that uses journal rankings to evaluate academics). An obvious place to submit to would be Discrete Analysis, but I feel uncomfortable about that for a different reason, especially now that I have discovered that the facility that enables all the discussion of a paper to be hidden from selected editors does not allow me, as administrator of the website, to hide a paper from myself. (I won’t have this last problem with Advances in Combinatorics, since the librarians at Queens will have the administrator role on the system.)

    So my personal options are somewhat limited, but getting better. If I have willing coauthors, then I would now consider (if I had a suitable paper), Acta Mathematica, Annales Henri Lebesgue, Journal de l’École Polytechnique, Discrete Analysis perhaps (but only if the other editors agreed to process my paper offline), Advances in Combinatorics, the Theory of Computing, Electronic Research Announcements in the Mathematical Sciences, the Electronic Journal of Combinatorics, and the Online Journal of Analytic Combinatorics. I also wouldn’t rule out Forum of Mathematics. A couple of journals to which I have an emotional attachment even if I don’t really approve of their practices are GAFA and Combinatorics, Probability and Computing. (The latter bothers me because it is a hybrid journal — that is, it charges subscriptions but also lets authors pay large APCs to make their articles open access, and I heard recently that if you choose the open access option, CUP retains copyright, so you’re not getting that much for your money. But I think not many authors choose this option. The former is also a hybrid journal, and is published by Springer.) Annals of Mathematics, if I’m lucky enough to have an Annals-worthy paper (though I think now I’d try Acta first), is not too bad — although its articles aren’t open access, their subscription costs are much more reasonable than most journals.

    That’s a list off the top of my head: if you think I’ve missed out a good option, then I’d be very happy to hear about it.

    As an editor, I have recently made the decision that I want to devote all my energies to promoting journals and “post-journal” systems that I fully approve of. So in order to make time for the work that will be involved in establishing Advances in Combinatorics, I have given notice to Forum of Mathematics and Mathematika, the two journals that took up the most of my time, that I will leave their editorial boards at the end of 2018. I feel quite sad about Forum of Mathematics, since I was involved in it from the start, and I really like the way it runs, with proper discussions amongst all the editors about the decisions we make. Also, I am less hostile (for reasons I’ve given in the past) to its APC model than most mathematicians. However, although I am less hostile, I could never say that I have positively liked it, and I came to the conclusion quite a while ago that, as many others have also said, it simply can’t be made to work satisfactorily: it will lead to just as bad market abuses as there are with the subscription system. In the UK it has been a disaster — government open-access mandates have led to universities paying as much as ever for subscriptions and then a whole lot extra for APCs. And there is a real worry that subscription big deals will be replaced by APC big deals, where a country pays a huge amount up front to a publisher in return for people from that country being able to publish with them. This, for example, is what Germany is pushing for. Fortunately, for the moment (if I understand correctly, though I don’t have good insider information on this) they are asking for the average fee per article to be much lower than Elsevier is prepared to accept: long may that impasse continue.

    So my leaving Forum of Mathematics is not a protest against it, but simply a practical step that will allow me to focus my energies where I think they can do the most good. I haven’t yet decided whether I ought to resign in protest from some other editorial boards of journals that don’t ask anything of me. Actually, even the practice of having a long list of names of editors, most of whom have zero involvement in the decisions of the journal, is one that bothers me. I recently heard of an Elsevier journal where almost all the editorial board would be happy to resign en masse and set up an ethical version, but the managing editor is strongly against. “But why don’t the rest of the board resign in that case?” I naively asked, to which the answer was, “Because he’s the one who does all the work!” From what I understood, this is literally true — the managing editor handles all the papers and makes all the decisions — but I’m not 100% sure about that.

    Is there any point in starting new journals?

    Probably major change, if it happens, will be the result of decisions made by major players such as government agencies, national negotiators, and so on. Compared with big events like the Elsevier negotiations in Germany, founding a new journal is a very small step. And even if all mathematicians gave up using the commercial publishers (not something I expect to see any time soon), that would have almost no direct effect, since mathematics journals are bundled together with journals in other subjects, which would continue with the current system.

    However, this is a familiar situation in politics. Big decisions are taken by people in positions of power, but what prompts them to make those decisions is often the result of changes in attitudes and behaviour of voters. And big behavioural changes do happen in academia. For example, as we all know, many people have got into the habit of posting all their work on the arXiv, and this accumulation of individual decisions has had the effect of completely changing the way dissemination works in some subjects, including mathematics, a change that has significantly weakened the hold that journals have — or would have if they weren’t bundled together with other journals. Who would ever subscribe at vast expense to a mathematics journal when almost all its content is available online in preprint form?

    So I see Advances in Combinatorics as a small step certainly, but a step that needs to be taken. I hope that it will demonstrate once again that starting a serious new journal is not that hard. I also hope that the current trickle of such journals will turn into a flood, that after the flood it will not be possible for people to argue that they are forced to submit articles to the commercial publishers, and that at some point, someone in a position of power will see what is going on, understand better the absurdities of the current system, and take a decision that benefits us all.

    http://gowers.wordpress.com/?p=6369
    Extensions
    Two infinities that are surprisingly equal
    News
    It has been in the news recently — or rather, the small corner of the news that is of particular interest to mathematicians — that Maryanthe Malliaris and Saharon Shelah recently had an unexpected breakthrough when they stumbled on a proof that two infinities were equal that had been conjectured, and widely believed, to be […]
    Show full content

    It has been in the news recently — or rather, the small corner of the news that is of particular interest to mathematicians — that Maryanthe Malliaris and Saharon Shelah recently had an unexpected breakthrough when they stumbled on a proof that two infinities were equal that had been conjectured, and widely believed, to be distinct. Or rather, since both were strictly between the cardinality of the natural numbers and the cardinality of the reals, they were widely believed to be distinct in some models of set theory where the continuum hypothesis fails.

    A couple of days ago, John Baez was sufficiently irritated by a Quanta article on this development that he wrote a post on Google Plus in which he did a much better job of explaining what was going on. As a result of reading that, and following and participating in the ensuing discussion, I have got interested in the problem. In particular, as a complete non-expert, I am struck that a problem that looks purely combinatorial (though infinitary) should, according to Quanta, have a solution that involves highly non-trivial arguments in proof theory and model theory. It makes me wonder, again as a complete non-expert so probably very naively, whether there is a simpler purely combinatorial argument that the set theorists missed because they believed too strongly that the two infinities were different.

    I certainly haven’t found such an argument, but I thought it might be worth at least setting out the problem, in case it appeals to anyone, and giving a few preliminary thoughts about it. I’m not expecting much from this, but if there’s a small chance that it leads to a fruitful mathematical discussion, then it’s worth doing. As I said above, I am indebted to John Baez and to several commenters on his post for being able to write much of what I write in this post, as can easily be checked if you read that discussion as well.

    A few definitions and a statement of the result

    The problem concerns the structure you obtain when you take the power set of the natural numbers and quotient out by the relation “has a finite symmetric difference with”. That is, we regard two sets A and B as equivalent if you can turn A into B by removing finitely many elements and adding finitely many other elements.

    It’s easy to check that this is an equivalence relation. We can also define a number of the usual set-theoretic operations. For example, writing [A] for the equivalence class of A, we can set [A]\cap[B] to be [A\cap B], [A]\cup [B] to be [A\cup B], [A]^c to be [A^c], etc. It is easy to check that these operations are well-defined.

    What about the subset relation? That too has an obvious definition. We don’t want to say that [A]\subset[B] if A\subset B, since that is not well-defined. However, we can define A to be almost contained in B if the set A\setminus B is finite, and then say that [A]\subset[B] if A is almost contained in B. This is well-defined and it’s also easy to check that it is true if and only if [A]\cap[B]=[A], which is the sort of thing we’d like to happen if our finite-fuzz set theory is to resemble normal set theory as closely as possible.

    I will use a non-standard piece of terminology and refer to an equivalence class of sets as an f-set, the “f” standing for “finite” or “fuzzy” (though these fuzzy sets are not to be confused with the usual definition of fuzzy sets, which I don’t know and probably never will know). I’ll also say things like “is f-contained in” (which means the same as “is almost contained in” except that it refers to the f-sets rather than to representatives of their equivalence classes).

    So far so good, but things start to get a bit less satisfactory when we consider infinite intersections and unions. How are we to define \bigcap_{n=1}^\infty[A_n], for example?

    An obvious property we would like is that the intersection should be the largest f-set that is contained in all the [A_n]. However, simple examples show that there doesn’t have to be a largest f-set contained in all the [A_n]. Indeed, let A_1\supset A_2\supset\dots be an infinite sequence of subsets of \mathbb N such that A_n\setminus A_{n+1} is infinite for every n. Then A is almost contained in every A_n if and only if A\setminus A_n is finite for every n. Given any such set, we can find for each n an element b_n of A_n\setminus A_{n+1} that is not contained in A (since A_n\setminus A_{n+1} is infinite but A\setminus A_{n+1} is finite). Then the set B=A\cup\{b_1,b_2,\dots\} is also almost contained in every A_n, and [A] is properly contained in [B] (in the obvious sense).

    OK, we don’t seem to have a satisfactory definition of infinite intersections, but we could at least hope for a satisfactory definition of “has an empty intersection”. And indeed, there is an obvious one. Given a collection of f-sets [A_\gamma], we say that its intersection is empty if the only f-set that is f-contained in every [A_\gamma] is [\emptyset]. (Note that [\emptyset] is the equivalence class of the empty set, which consists of all finite subsets of \mathbb N.) In terms of the sets rather than their equivalence classes, this is saying that there is no infinite set that is almost contained in every A_\gamma.

    An important concept that appears in many places in mathematics, but particularly in set theory, is the finite-intersection property. A collection \mathcal A of subsets of a set X is said to have this property if A_1\cap\dots\cap A_n is non-empty whenever A_1,\dots,A_n\in\mathcal A. This definition carries over to f-sets with no problem at all, since finite f-intersections were easy to define.

    Let’s ask ourselves a little question here: can we find a collection of f-sets with the finite-intersection property but with an empty intersection? That is, no finite intersection is empty, but the intersection of all the f-sets is empty.

    That should be pretty easy. For sets, there are very simple examples like A_n=\{n,n+1,\dots\} — finitely many of those have a non-empty intersection, but there is no set that’s contained in all of them.

    Unfortunately, all those sets are the same if we turn them into f-sets. But there is an obvious way of adjusting the example: we just take sets A_1\supset A_2\supset\dots such that A_n\setminus A_{n+1} is infinite for each n and \bigcap_{n=1}^\infty A_n=\emptyset. That ought to do the job once we turn each A_n into its equivalence class [A_n].

    Except that it doesn’t do the job. In fact, we’ve already observed that we can just pick a set B=\{b_1,b_2,\dots\} with b_n\in A_n\setminus A_{n+1} and then [B] will be a non-empty f-intersection of the A_n.

    However, here’s an example that does work. We’ll take all f-sets [A] such that A has density 1. (This means that n^{-1}|A\cap\{1,2,\dots,n\}| tends
    to 1 as n tends to infinity.) Since the intersection of any two sets of density 1 has density 1 (a simple exercise), this collection of f-sets has the finite-intersection property. I claim that any f-set contained in all these f-sets must be [\emptyset].

    Indeed, let B be an infinite set and (b_1,b_2,\dots) the enumeration of its elements in increasing order. We can pick a subsequence (c_1,c_2,\dots) such that c_n\geq 2^n for every n, and the corresponding subset C is an infinite subset of B with density zero. Therefore, \mathbb N\setminus C is a set of density 1 that does not almost contain B.

    The number of f-sets we took there in order to achieve an f-empty intersection was huge: the cardinality of the continuum. (That’s another easy exercise.) Did we really need that many? This innocent question leads straight to a definition that is needed in order to understand what Malliaris and Shelah did.

    Definition. The cardinal p is the smallest cardinality of a collection F of f-sets such that F has the finite-intersection property but also F has an empty f-intersection.

    It is simple to prove that this cardinal is uncountable, but it is also known not to be as big as the cardinality of the continuum (where again this means that there are models of set theory — necessarily ones where CH fails — for which it is strictly smaller). So it is a rather nice intermediate cardinal, which partially explains its interest to set theorists.

    The cardinal p is one of the two infinities that Malliaris and Shelah proved were the same. The other one is closely related. Define a tower to be a collection of f-sets that does not contain [\emptyset] and is totally ordered by inclusion. Note that a tower T trivially satisfies the finite-intersection property: if [A_1],\dots,[A_n] belong to T, then the smallest of the f-sets [A_i] is the f-intersection and it isn’t f-empty. So let’s make another definition.

    Definition. The cardinal t is the smallest cardinality of a tower T that has an empty f-intersection.

    Since a tower has the finite-intersection property, we are asking for something strictly stronger before, so strictly harder to obtain. It follows that t is at least as large as p.

    And now we have the obvious question: is the inequality strict? As I have said, it was widely believed that it was, and a big surprise when Malliaris and Shelah proved that the two infinities were in fact equal.

    What does this actually say? It says that if you can find a bunch F of f-sets with the finite-intersection property and an empty f-intersection, then you can find a totally ordered example T that has at most the cardinality of F.

    Why is the problem hard?

    I don’t have a sophisticated answer to this that would explain why it is hard to experts in set theory. I just want to think about why it might be hard to prove the statement using a naive approach.

    An immediate indication that things might be difficult is that it isn’t terribly easy to give any example of a tower with an empty f-intersection, let alone one with small cardinality.

    An indication of the problem we face was already present when I gave a failed attempt to construct a system of sets with the finite-intersection property and empty intersection. I took a nested sequence [A_1]\supset[A_2]\supset such that the sets A_n had empty intersection, but that didn’t work because I could pick an element from each A_n\setminus A_{n+1} and put those together to make a non-empty f-intersection. (I’m using “f-intersection” to mean any f-set f-contained in all the given f-sets. In general, we can’t choose a largest one, so it’s far from unique. The usual terminology would be to say that if A is almost contained in every set from a collection of sets, then A is a pseudointersection of that collection. But I’m trying to express as much as possible in terms of f-sets.)

    Anyone who is familiar with ordinal hierarchies will see that there is an obvious thing we could do here. We could start as above, and then when we find the annoying f-intersection we simply add it to the tower and call it [A_\omega]. And then inside [A_\omega] we can find another nested decreasing sequence of sets and call those [A_{\omega+1}], [A_{\omega+2}],\dots and so on. Those will also have a non-empty f-intersection, which we could call [A_{2\omega}], and so on.

    Let’s use this idea to prove that there do exist towers with empty f-intersections. I shall build a collection of non-empty f-sets [A_\alpha] by transfinite induction. If I have already built [A_\alpha], I let [A_{\alpha+1}] be any non-empty f-set that is strictly f-contained in [A_\alpha]. That tells me how to build my sets at successor ordinals. If \alpha is a limit ordinal, then I’ll take A_\alpha to be a non-empty f-intersection of all the [A_\beta] with \beta<\alpha.

    But how am I so sure that such an f-intersection exists? I’m not, but if it doesn’t exist, then I’m very happy, as that means that the f-sets [A_\beta] with \beta<\alpha form a tower with empty f-intersection.

    Since all the f-sets in this tower are distinct, the process has to terminate at some point, and that implies that a tower with empty f-intersection must exist.

    For a lot of ordinal constructions like this, one can show that the process terminates at the first uncountable ordinal, \omega_1. To set theorists, this has extremely small cardinality — by definition, the smallest one after the cardinality of the natural numbers. In some models of set theory, there will be a dizzying array of cardinals between this and the cardinality of the continuum.

    In our case it is not too hard to prove that the process doesn’t terminate before we get to the first uncountable ordinal. Indeed, if \alpha is a countable limit ordinal, then we can take an increasing sequence of ordinals \alpha_n that tend to \alpha, pick an element b_n from A_{\alpha_n}\setminus A_{\alpha_{n+1}}, and define A_\alpha to be \{b_1,b_2,\dots\}.

    However, there doesn’t seem to be any obvious argument to say that the f-sets [A_\alpha] with \alpha<\omega_1 have an empty f-intersection, even if we make some effort to keep our sets small (for example, by defining A_{\alpha+1} to consist of every other element of A_\alpha). In fact, we sort of know that there won’t be such an argument, because if there were, then it would show that there was a tower whose cardinality was that of the first uncountable ordinal. That would prove that t had this cardinality, and since p is uncountable (that is easy to check) we would immediately know that p and t were equal.

    So that’s already an indication that something subtle is going on that you need to be a proper set theorist to understand properly.

    But do we need to understand these funny cardinalities to solve the problem? We don’t need to know what they are — just to prove that they are the same. Perhaps that can still be done in a naive way.

    So here’s a very naive idea. Let’s take a set F of f-sets with the finite intersection property and empty f-intersection, and let’s try to build a tower T with empty intersection using only sets from F. This would certainly be sufficient for showing that T has cardinality at most that of F, and if F has minimal cardinality it would show that p=t.

    There’s almost no chance that this will work, but let’s at least see where it goes wrong, or runs into a brick wall.

    At first things go swimmingly. Let [A]\in F. Then there must exist an f-set [A']'\in F that does not f-contain [A], since otherwise [A] itself would be a non-empty f-intersection for F. But then [A]\cap [A'] is a proper f-subset of [A], and by the finite-intersection property it is not f-empty.

    By iterating this argument, we can therefore obtain a nested sequence [A_1]\supset[A_2]\supset of f-sets in F.

    The next thing we’d like to do is create [A_\omega]. And this, unsurprisingly, is where the brick wall is. Consider, for example, the case where F consists of all sets of density 1. What if we stupidly chose A_n in such a way that \min A_n\geq 2^n for every n? Then our diagonal procedure — picking an element from each set A_n\setminus A_{n+1} — would yield a set of density zero. Of course, we could go for a different diagonal procedure. We would need to prove that for this particular F and any nested sequence we can always find an f-intersection that belongs to F. That’s equivalent to saying that for any sequence A_1\supset A_2\supset of dense sets we can find a set A such that A\setminus A_n is finite for every n and A has density 1.

    That’s a fairly simple (but not trivial) exercise I think, but when I tried to write a proof straight down I failed — it’s more like a pen-and-paper job until you get the construction right. But here’s the real question I’d like to know the answer to right at this moment. It splits into two questions actually.

    Question 1. Let F be a collection of f-sets with the finite-intersection property and no non-empty f-intersection. Let [A_1]\supset[A_2]\supset\dots be a nested sequence of elements of F. Must this sequence have an f-intersection that belongs to F?

    Question 2. If, as seems likely, the answer to Question 1 is no, must it at least be the case that there exists a nested sequence in F with an f-intersection that also belongs to F?

    If the answer to Question 2 turned out to be yes, it would naturally lead to the following further question.

    Question 3. If the answer to Question 2 is yes, then how far can we go with it? For example, must F contain a nested transfinite sequence of uncountable length?

    Unfortunately, even a positive answer to Question 3 would not be enough for us, for reasons I’ve already given. It might be the case that we can indeed build nice big towers in F, but that the arguments stop working once we reach the first uncountable ordinal. Indeed, it might well be known that there are sets F with the finite-intersection property and no non-empty f-intersection that do not contain towers that are bigger than this. If that’s the case, it would give at least one serious reason for the problem being hard. It would tell us that we can’t prove the equality by just finding a suitable tower inside F: instead, we’d need to do something more indirect, constructing a tower T and some non-obvious injection from T to F. (It would be non-obvious because it would not preserve the subset relation.)

    Another way the problem might be difficult is if F does contain a tower with no non-empty f-intersection, but we can’t extend an arbitrary tower in F to a tower with this property. Perhaps if we started off building our tower the wrong way, it would lead us down a path that had a dead end long before the tower was big enough, even though good paths and good towers did exist.

    But these are just pure speculations on my part. I’m sure the answers to many of my questions are known. If so, I’ll be interested to hear about it, and to understand better why Malliaris and Shelah had to use big tools and a much less obvious argument than the kind of thing I was trying to do above.

    http://gowers.wordpress.com/?p=6359
    Extensions
    Intransitive dice VII — aiming for further results
    polymath13
    While Polymath13 has (barring a mistake that we have not noticed) led to an interesting and clearly publishable result, there are some obvious follow-up questions that we would be wrong not to try to answer before finishing the project, especially as some of them seem to be either essentially solved or promisingly close to a […]
    Show full content

    While Polymath13 has (barring a mistake that we have not noticed) led to an interesting and clearly publishable result, there are some obvious follow-up questions that we would be wrong not to try to answer before finishing the project, especially as some of them seem to be either essentially solved or promisingly close to a solution. The ones I myself have focused on are the following.

    1. Is it true that if two random elements A and B of [n]^n are chosen, then A beats B with very high probability if it has a sum that is significantly larger? (Here “significantly larger” should mean larger by f(n) for some function f(n)=o(n^{3/2}) — note that the standard deviation of the sum has order n^{3/2}, so the idea is that this condition should be satisfied one way or the other with probability 1-o(1)).
    2. Is it true that the stronger conjecture, which is equivalent (given what we now know) to the statement that for almost all pairs (A,B) of random dice, the event that A beats a random die C has almost no correlation with the event that B beats C, is false?
    3. Can the proof of the result obtained so far be modified to show a similar result for the multisets model?

    The status of these three questions, as I see it, is that the first is basically solved — I shall try to justify this claim later in the post, for the second there is a promising approach that will I think lead to a solution — again I shall try to back up this assertion, and while the third feels as though it shouldn’t be impossibly difficult, we have so far made very little progress on it, apart from experimental evidence that suggests that all the results should be similar to those for the balanced sequences model. [Added after finishing the post: I may possibly have made significant progress on the third question as a result of writing this post, but I haven’t checked carefully.]

    The strength of a die depends strongly on the sum of its faces.

    Let A=(a_1,\dots,a_n) and B=(b_1,\dots,b_n) be elements of [n]^n chosen uniformly and independently at random. I shall now show that the average of

    |\{(i,j):a_i>b_j\}|-|\{(i,j):a_i<b_j\}|-\sum_ia_i+\sum_jb_j

    is zero, and that the probability that this quantity differs from its average by substantially more than n\log n is very small. Since typically the modulus of \sum_ia_i-\sum_jb_j has order n^{3/2}, it follows that whether or not A beats B is almost always determined by which has the bigger sum.

    As in the proof of the main theorem, it is convenient to define the functions

    f_A(j)=|\{i:a_i<j\}|+\frac 12|\{i:a_i=j\}|

    and

    g_A(j)=f_A(j)-j+\frac 12.

    Then

    \sum_jf_A(b_j)=\sum_{i,j}\mathbf 1_{a_i<b_j}+\frac 12\sum_{i,j}\mathbf 1_{a_i=b_j},

    from which it follows that B beats A if and only if \sum_jf_A(b_j)>n^2/2. Note also that

    \sum_jg_A(b_j)=\sum_jf_A(b_j)-\sum_jb_j+\frac n2.

    If we choose A purely at random from [n]^n, then the expectation of f_A(j) is j-1/2, and Chernoff’s bounds imply that the probability that there exists j with |g_A(j)|=|f_A(j)-j+1/2|\geq C\sqrt{n\log n} is, for suitable C at most n^{-10}. Let us now fix some A for which there is no such j, but keep B as a purely random element of [n]^n.

    Then \sum_jg_A(b_j) is a sum of n independent random variables, each with maximum at most C\sqrt{n\log n}. The expectation of this sum is \sum_jg_A(j)=\sum_jf_A(j)-n^2/2.

    But

    \sum_jf_A(j)=\sum_{i,j}\mathbf 1_{a_i<j}+\frac 12\sum_{i,j}\mathbf 1_{a_i=j}

    =\sum_i(n-a_i)+\frac n2=n^2+\frac n2-\sum_ia_i,

    so the expectation of \sum_jg_A(b_j) is n(n+1)/2-\sum_ia_i.

    By standard probabilistic estimates for sums of independent random variables, with probability at least 1-n^{-10} the difference between \sum_jg_A(b_j) and its expectation \sum_jf_A(j)-n^2/2 is at most Cn\log n. Writing this out, we have

    |\sum_jf_A(b_j)-\sum_jb_j+\frac n2-n(n+1)/2+\sum_ia_i|\leq Cn\log n,

    which works out as

    |\sum_jf_A(b_j)-\frac {n^2}2-\sum_jb_j+\sum_ia_i|\leq Cn\log n.

    Therefore, if \sum_ia_i>\sum_jb_j+Cn\log n, it follows that with high probability \sum_jf_A(b_j)<n^2/2, which implies that A beats B, and if \sum_jb_j>\sum_ia_i+Cn\log n, then with high probability B beats A. But one or other of these two cases almost always happens, since the standard deviations of \sum_ia_i and \sum_jb_j are of order n^{3/2}. So almost always the die that wins is the one with the bigger sum, as claimed. And since “has a bigger sum than” is a transitive relation, we get transitivity almost all the time.

    Why the strong conjecture looks false

    As I mentioned, the experimental evidence seems to suggest that the strong conjecture is false. But there is also the outline of an argument that points in the same direction. I’m going to be very sketchy about it, and I don’t expect all the details to be straightforward. (In particular, it looks to me as though the argument will be harder than the argument in the previous section.)

    The basic idea comes from a comment of Thomas Budzinski. It is to base a proof on the following structure.

    1. With probability bounded away from zero, two random dice A and B are “close”.
    2. If A and B are two fixed dice that are close to each other and C is random, then the events “A beats C” and “B beats C” are positively correlated.

    Here is how I would imagine going about defining “close”. First of all, note that the function g_A is somewhat like a random walk that is constrained to start and end at zero. There are results that show that random walks have a positive probability of never deviating very far from the origin — at most half a standard deviation, say — so something like the following idea for proving the first step (remaining agnostic for the time being about the precise definition of “close”). We choose some fixed positive integer k and let x_1<\dots<x_k be integers evenly spread through the interval \{1,2,\dots,n\}. Then we argue — and this should be very straightforward — that with probability bounded away from zero, the values of f_A(x_i) and f_B(x_i) are close to each other, where here I mean that the difference is at most some small (but fixed) fraction of a standard deviation.

    If that holds, it should also be the case, since the intervals between x_{i-1} and x_i are short, that f_A and f_B are uniformly close with positive probability.

    I’m not quite sure whether proving the second part would require the local central limit theorem in the paper or whether it would be an easier argument that could just use the fact that since f_A and f_B are close, the sums \sum_jf_A(c_j) and \sum_jf_B(c_j) are almost certainly close too. Thomas Budzinski sketches an argument of the first kind, and my guess is that that is indeed needed. But either way, I think it ought to be possible to prove something like this.

    What about the multisets model?

    We haven’t thought about this too hard, but there is a very general approach that looks to me promising. However, it depends on something happening that should be either quite easy to establish or not true, and at the moment I haven’t worked out which, and as far as I know neither has anyone else.

    The difficulty is that while we still know in the multisets model that A beats B if and only if \sum_jf_A(b_j)<n^2/2 (since this depends just on the dice and not on the model that is used to generate them randomly), it is less easy to get traction on the sum because it isn’t obvious how to express it as a sum of independent random variables.

    Of course, we had that difficulty with the balanced-sequences model too, but there we got round the problem by considering purely random sequences B and conditioning on their sum, having established that certain events held with sufficiently high probability for the conditioning not to stop them holding with high probability.

    But with the multisets model, there isn’t an obvious way to obtain the distribution over random dice B by choosing b_1,\dots,b_n independently (according to some distribution) and conditioning on some suitable event. (A quick thought here is that it would be enough if we could approximate the distribution of B in such a way, provided the approximation was good enough. The obvious distribution to take on each b_i is the marginal distribution of that b_i in the multisets model, and the obvious conditioning would then be on the sum, but it is far from clear to me whether that works.)

    A somewhat different approach that I have not got far with myself is to use the standard one-to-one correspondence between increasing sequences of length n taken from [n] and subsets of [2n-1] of size n. (Given such a sequence (a_1,\dots,a_n) one takes the subset \{a_1,a_2+1,\dots,a_n+n-1\}, and given a subset S=\{s_1,\dots,s_n\}\subset[2n-1], where the s_i are written in increasing order, one takes the multiset of all values s_i-i+1, with multiplicity.) Somehow a subset of [2n-1] of size n feels closer to a bunch of independent random variables. For example, we could model it by choosing each element with probability n/(2n-1) and conditioning on the number of elements being exactly n, which will happen with non-tiny probability.

    Actually, now that I’m writing this, I’m coming to think that I may have accidentally got closer to a solution. The reason is that earlier I was using a holes-and-pegs approach to defining the bijection between multisets and subsets, whereas with this approach, which I had wrongly assumed was essentially the same, there is a nice correspondence between the elements of the multiset and the elements of the set. So I suddenly feel more optimistic that the approach for balanced sequences can be adapted to the multisets model.

    I’ll end this post on that optimistic note: no doubt it won’t be long before I run up against some harsh reality.

    http://gowers.wordpress.com/?p=6345
    Extensions
    Another journal flips
    ElsevierMathematics on the internetNews
    There is widespread (even if not universal) agreement that something is deeply wrong with the current system of academic publishing. The basic point, which has been made innumerable times by innumerable people, is that the really hard parts — the writing of papers, and the peer review and selection of the ones to publish — […]
    Show full content

    There is widespread (even if not universal) agreement that something is deeply wrong with the current system of academic publishing. The basic point, which has been made innumerable times by innumerable people, is that the really hard parts — the writing of papers, and the peer review and selection of the ones to publish — are done voluntarily by academics, and modern technology makes things like typesetting and dissemination extremely cheap. And yet publishers are making more money than ever before. They do this by insisting that we give them ownership of the content we produce (though many journals will publish papers even if you strike out the part of the contract that hands them this ownership — these days I never agree to give copyright to a publisher, and I urge you not to either), and by bundling their journals together so that libraries are forced into an all-or-nothing decision. (As another aside, I also urge libraries to look closely at what is happening in Germany, where they have gone for the “nothing” option with Elsevier and the world has not come to an end.)

    What can be done about this? There are many actions, none of which are likely to be sufficient to bring about major change on their own, but which in combination will help to get us to a tipping point. In no particular order, here are some of them.

    1. Create new journals that operate much more cheaply and wait for them to become established.
    2. Persuade libraries not to agree to Big Deals with the big publishers.
    3. Refuse to publish with, write for, or edit for, the big publishers.
    4. Make sure all your work is freely available online.
    5. Encourage journals that are supporting the big publishers to leave those publishers and set up in a cheaper and fairer way.

    Not all of these are easy things to do, but I’m delighted to report that a small group I belong to, set up by Mark Wilson, has, after approaching a large number of maths journals, found one that was ready to “flip”: the Journal of Algebraic Combinatorics has just announced that it will be leaving Springer. Or if you want to be more pedantic about it, a new journal will be starting, called Algebraic Combinatorics and published by The Mersenne Centre for Open Scientific Publishing, and almost all the editors of the Journal of Algebraic Combinatorics will resign from that journal and become editors of the new one, which will adhere to Fair Open Access Principles.

    If you want to see change, then you should from now on regard Algebraic Combinatorics as the true continuation of the Journal of Algebraic Combinatorics, and the Journal of Algebraic Combinatorics as a zombie journal that happens to have a name that coincides with a former real journal. And of course, that means that if you are an algebraic combinatorialist with a paper that would have been suitable for the Journal of Algebraic Combinatorics, you should understand that the reputation of the Journal of Algebraic Combinatorics is being transferred, along with the editorial board, to Algebraic Combinatorics, and you should therefore submit it to Algebraic Combinatorics. This has worked with previous flips: the zombie journal rarely thrives afterwards and in some notable cases has ceased to publish after a couple of years or so.

    The words of one of the editors of the Journal of Algebraic Combinatorics, Hugh Thomas, are particularly telling, especially the first sentence: “There wasn’t a particular crisis. It has been becoming more and more clear that commercial journal publishers are charging high subscription fees and high Article Processing Charges (APCs), profiting from the volunteer labour of the academic community, and adding little value. It is getting easier and easier to automate the things that they once took care of. The actual printing and distribution of paper copies is also much less important than it has been in the past; this is something which we have decided we can do without.”

    I mentioned earlier that we approached many journals. Although it is very exciting that one journal is flipping, I must also admit to disappointment at how low our strike rate has been so far. However, the words “so far” are important: many members of editorial boards were very sympathetic with our aims, and some journals were adopting a wait-and-see attitude, so if the flip of JACo is successful, we hope that it will encourage other journals. I should say that we weren’t just saying, “Why don’t you flip?” but we were also offering support, including financial support. The current situation is that we can almost certainly finance journals that are ready to flip to an “ultra-cheap” model (using a platform that charges either nothing or a very small fee per submission) and help with administrative support, and are working on financial support for more expensive models, but still far cheaper than the commercial publishers, where more elaborate services are offered.

    Understandably, the main editors tended to be a lot more cautious on average than the bulk of the editorial boards. I think many of them were worried that they might accidentally destroy their journals if they flipped them, and in the case of journals with long traditions, this is not something one would want to be remembered for. So again, the more we can support Algebraic Combinatorics, the more likely it is that this caution will be reduced and other journals will consider following. (If you are an editor of a journal we have not approached, please do get in touch to discuss what the possibilities are — we have put a lot of thought into it.)

    Another argument put forward by some editors is that to flip a journal risks damaging the reputation of the old version of the journal, and therefore, indirectly, the reputation of the papers published in it, some of which are by early-career researchers. So they did not want to flip in order to avoid damaging the careers of young mathematicians. If you are a young mathematician and would like to comment on whether you would be bothered by a journal flipping after you had published in it, we would be very interested to hear what you have to say.

    Against that background I’d like to congratulate the editors of the Journal of Algebraic Combinatorics for their courage and for the work they have put into this. (But that word “work” should not put off other editors: one of the aims of our small group was to provide support and expertise, including from Johann Rooryck, the editor of the Elsevier journal Lingua, which flipped to become Glossa, in order to make the transition as easy as possible.) I’d also like to make clear, to avoid any misunderstanding that might arise, that although I’ve been involved in a lot of discussion with Mark Wilson’s group and wrote to many editors of other journals, my role in this particular flip has been a minor one.

    And finally, let me repeat the main message of this post: please support the newly flipped journal, since the more successful it is, the greater the chance that other journals will follow, and the greater the chance that we will be able to move to a more sensible academic publishing system.

    http://gowers.wordpress.com/?p=6336
    Extensions
    Intransitive dice VI: sketch proof of the main conjecture for the balanced-sequences model
    polymath13
    I have now completed a draft of a write-up of a proof of the following statement. Recall that a random -sided die (in the balanced-sequences model) is a sequence of length of integers between 1 and that add up to , chosen uniformly from all such sequences. A die beats a die if the number […]
    Show full content

    I have now completed a draft of a write-up of a proof of the following statement. Recall that a random n-sided die (in the balanced-sequences model) is a sequence of length n of integers between 1 and n that add up to n(n+1)/2, chosen uniformly from all such sequences. A die (a_1,\dots,a_n) beats a die (b_1,\dots,b_n) if the number of pairs (i,j) such that a_i>b_j exceeds the number of pairs (i,j) such that a_i<b_j. If the two numbers are the same, we say that A ties with B.

    Theorem. Let A,B and C be random n-sided dice. Then the probability that A beats C given that A beats B and B beats C is \frac 12+o(1).

    In this post I want to give a fairly detailed sketch of the proof, which will I hope make it clearer what is going on in the write-up.

    The first step is to show that the theorem is equivalent to the following statement.

    Theorem. Let A be a random n-sided die. Then with probability 1-o(1), the proportion of n-sided dice that A beats is \frac 12+o(1).

    We had two proofs of this statement in earlier posts and comments on this blog. In the write-up I have used a very nice short proof supplied by Luke Pebody. There is no need to repeat it here, since there isn’t much to say that will make it any easier to understand than it already is. I will, however, mention once again an example that illustrates quite well what this statement does and doesn’t say. The example is of a tournament (that is, complete graph where every edge is given a direction) where every vertex beats half the other vertices (meaning that half the edges at the vertex go in and half go out) but the tournament does not look at all random. One just takes an odd integer n and puts arrows out from x to x+y mod n for every y\in\{1,2,\dots,(n-1)/2\}, and arrows into x for every y\in\{(n+1)/2,\dots,n-1\}. It is not hard to check that the probability that there is an arrow from x to z given that there are arrows from x to y and y to z is approximately 1/2, and this turns out to be a general phenomenon.

    So how do we prove that almost all n-sided dice beat approximately half the other n-sided dice?

    The first step is to recast the problem as one about sums of independent random variables. Let [n] stand for \{1,2,\dots,n\} as usual. Given a sequence A=(a_1,\dots,a_n)\in[n]^n we define a function f_A:[n]\to[n] by setting f_A(j) to be the number of i such that a_i<j plus half the number of i such that a_i=j. We also define g_A(j) to be f_A(j)-(j-1/2). It is not hard to verify that A beats B if \sum_jg_A(b_j)<0, ties with B if \sum_jg_A(b_j)=0, and loses to B if \sum_jg_A(b_j)>0.

    So our question now becomes the following. Suppose we choose a random sequence (b_1,\dots,b_n) with the property that \sum_jb_j=n(n+1)/2. What is the probability that \sum_jg_A(b_j)>0? (Of course, the answer depends on A, and most of the work of the proof comes in showing that a “typical” A has properties that ensure that the probability is about 1/2.)

    It is convenient to rephrase the problem slightly, replacing b_j by b_j-(n+1)/2. We can then ask it as follows. Suppose we choose a sequence (V_1,\dots,V_n) of n elements of the set \{-(n-1)/2,-(n-1)/2+1,\dots,(n-1)/2\}, where the terms of the sequence are independent and uniformly distributed. For each j let U_j=g_A(V_j). What is the probability that \sum_jU_j>0 given that \sum_jV_j=0?

    This is a question about the distribution of \sum_j(U_j,V_j), where the (U_j,V_j) are i.i.d. random variables taking values in \mathbb Z^2 (at least if n is odd — a small modification is needed if n is even). Everything we know about probability would lead us to expect that this distribution is approximately Gaussian, and since it has mean (0,0), it ought to be the case that if we sum up the probabilities that \sum_j(U_j,V_j)=(x,0) over positive x, we should get roughly the same as if we sum them up over negative x. Also, it is highly plausible that the probability of getting (0,0) will be a lot smaller than either of these two sums.

    So there we have a heuristic argument for why the second theorem, and hence the first, ought to be true.

    There are several theorems in the literature that initially seemed as though they should be helpful. And indeed they were helpful, but we were unable to apply them directly, and had instead to develop our own modifications of their proofs.

    The obvious theorem to mention is the central limit theorem. But this is not strong enough for two reasons. The first is that it tells you about the probability that a sum of random variables will lie in some rectangular region of \mathbb R^2 of size comparable to the standard deviation. It will not tell you the probability of belonging to some subset of the y-axis (even for discrete random variables). Another problem is that the central limit on its own does not give information about the rate of convergence to a Gaussian, whereas here we require one.

    The second problem is dealt with for many applications by the Berry-Esseen theorem, but not the first.

    The first problem is dealt with for many applications by local central limit theorems, about which Terence Tao has blogged in the past. These tell you not just about the probability of landing in a region, but about the probability of actually equalling some given value, with estimates that are precise enough to give, in many situations, the kind of information that we seek here.

    What we did not find, however, was precisely the theorem we were looking for: a statement that would be local and 2-dimensional and would give information about the rate of convergence that was sufficiently strong that we would be able to obtain good enough convergence after only n steps. (I use the word “step” here because we can think of a sum of n independent copies of a 2D random variable as an n-step random walk.) It was not even clear in advance what such a theorem should say, since we did not know what properties we would be able to prove about the random variables (U_i,V_i) when A was “typical”. That is, we knew that not every A worked, so the structure of the proof (probably) had to be as follows.

    1. Prove that A has certain properties with probability 1-o(1).

    2. Using these properties, deduce that the sum \sum_{i=1}^n(U_i,V_i) converges very well after n steps to a Gaussian.

    3. Conclude that the heuristic argument is indeed correct.

    The key properties that A needed to have were the following two. First, there needed to be a bound on the higher moments of U. This we achieved in a slightly wasteful way — but the cost was a log factor that we could afford — by arguing that with high probability no value of g_A(j) has magnitude greater than 6\sqrt{n\log n}. To prove this the steps were as follows.

    1. Let A be a random element of [n]^n. Then the probability that there exists j with g_A(j)\geq 6\sqrt{n\log n} is at most n^{-k} (for some k such as 10).
    2. The probability that \sum_ia_i=n(n+1)/2 is at least cn^{-3/2} for some absolute constant c>0.
    3. It follows that if A is a random n-sided die, then with probability 1-o(1) we have |g_A(j)|\leq 6\sqrt{n\log n} for every j.

    The proofs of the first two statements are standard probabilistic estimates about sums of independent random variables.

    The second property that A needed to have is more difficult to obtain. There is a standard Fourier-analytic approach to proving central limit theorems, and in order to get good convergence it turns out that what one wants is for a certain Fourier transform to be sufficiently well bounded away from 1. More precisely, we define the characteristic function of the random variable (U,V) to be

    \hat f(\alpha,\beta)=\mathbb E e(\alpha U+\beta V)=\sum_{x,y}f(x,y)e(\alpha x+\beta y),

    where e(x) is shorthand for \exp(2\pi ix), f(x,y)=\mathbb P[(U,V)=(x,y)], and \alpha and \beta range over \mathbb T=\mathbb R/\mathbb Z.

    I’ll come later to why it is good for \hat f(\alpha,\beta) not to be too close to 1. But for now I want to concentrate on how one proves a statement like this, since that is perhaps the least standard part of the argument.

    To get an idea, let us first think what it would take for |\hat f(\alpha,\beta)| to be very close to 1. This condition basically tells us that \alpha U+\beta V is highly concentrated mod 1: indeed, if \alpha U+\beta V is highly concentrated, then e(\alpha U+\beta V) takes approximately the same value almost all the time, so the average is roughly equal to that value, which has modulus 1; conversely, if \alpha U+\beta V is not highly concentrated mod 1, then there is plenty of cancellation between the different values of e(\alpha U+\beta V) and the result is that the average has modulus appreciably smaller than 1.

    So the task is to prove that the values of \alpha U+\beta V are reasonably well spread about mod 1. Note that this is saying that the values of \alpha g_A(j)+\beta(j-(n+1)/2) are reasonably spread about.

    The way we prove this is roughly as follows. Let \alpha>0, let m be of order of magnitude \alpha^{-2}, and consider the values of g_A at the four points j, j+m, j+2m and j+3m. Then a typical order of magnitude of g_A(j)-g_A(j+m) is around \sqrt m, and one can prove without too much trouble (here the Berry-Esseen theorem was helpful to keep the proof short) that the probability that

    |g_A(j)-g_A(j+m)-g_A(j+2m)+g_A(3m)|\geq c\sqrt m

    is at least c, for some positive absolute constant c. It follows by Markov’s inequality that with positive probability one has the above inequality for many values of j.

    That’s not quite good enough, since we want a probability that’s very close to 1. This we obtain by chopping up [n] into intervals of length 4m and applying the above argument in each interval. (While writing this I’m coming to think that I could just as easily have gone for progressions of length 3, not that it matters much.) Then in each interval there is a reasonable probability of getting the above inequality to hold many times, from which one can prove that with very high probability it holds many times.

    But since m is of order \alpha^{-2}, \alpha\sqrt m is of order 1, which gives that the values e(g_A(j)), e(g_A(j+m)), e(g_A(j+2m), e(g_A(j+3m)) are far from constant whenever the above inequality holds. So by averaging we end up with a good upper bound for |\hat f(\alpha,\beta)|.

    The alert reader will have noticed that if \alpha\ll n^{-1/2}, then the above argument doesn’t work, because we can’t choose m to be bigger than n. In that case, however, we just do the best we can: we choose m to be of order n/\log n, the logarithmic factor being there because we need to operate in many different intervals in order to get the probability to be high. We will get many quadruples where

    \alpha|g_A(j)-g_A(j+m)-g_A(j+2m)+g_A(3m)|\geq c\alpha\sqrt m=c'\alpha\sqrt{n/\log n},

    and this translates into a lower bound for 1-|\hat f(\alpha,\beta)| of order \alpha^2n/\log n, basically because 1-\cos\theta has order \theta^2 for small \theta. This is a good bound for us as long as we can use it to prove that |\hat f(\alpha,\beta)|^n is bounded above by a large negative power of n. For that we need \alpha^2n/\log n to be at least C\log n/n (since (1-C\log n/n)^n is about \exp(-C\log n)=n^{-C}), so we are in good shape provided that \alpha\gg\log n/n.

    The alert reader will also have noticed that the probabilities for different intervals are not independent: for example, if some f_A(j) is equal to n, then beyond that g_A(j) depends linearly on j. However, except when j is very large, this is extremely unlikely, and it is basically the only thing that can go wrong. To make this rigorous we formulated a concentration inequality that states, roughly speaking, that if you have a bunch of k events, and almost always (that is, always, unless some very unlikely event occurs) the probability that the ith event holds given that all the previous events hold is at least c, then the probability that fewer than ck/2 of the events hold is exponentially small in k. The proof of the concentration inequality is a standard exponential-moment argument, with a small extra step to show that the low-probability events don’t mess things up too much.

    Incidentally, the idea of splitting up the interval in this way came from an answer by Serguei Popov to a Mathoverflow question I asked, when I got slightly stuck trying to prove a lower bound for the second moment of U. I eventually didn’t use that bound, but the interval-splitting idea helped for the bound for the Fourier coefficient as well.

    So in this way we prove that |\hat f(\alpha,\beta)|^n is very small if |\alpha|\gg\log n/n. A simpler argument of a similar flavour shows that |\hat f(\alpha,\beta)|^n is also very small if |\alpha| is smaller than this and |\beta|\gg n^{-3/2}.

    Now let us return to the question of why we might like |\hat f(\alpha,\beta)|^n to be small. It follows from the inversion and convolution formulae in Fourier analysis. The convolution formula tells us that the characteristic function of the sum of the U_i (which are independent and each have characteristic function \hat f) is (\hat f)^n. And then the inversion formula tells us that

    \mathbb P[(\sum_iU_i,\sum_iV_i)=(x,y)]=\int_{(\alpha,\beta)\in\mathbb T^2}\hat f(\alpha,\beta)^ne(-\alpha x-\beta y)\mathop{d\alpha}\mathop{d\beta}

    What we have proved can be used to show that the contribution to the integral on the right-hand side from those pairs (\alpha,\beta) that lie outside a small rectangle (of width n^{-1} in the \alpha direction and n^{-3/2} in the \beta direction, up to log factors) is negligible.

    All the above is true provided the random n-sided die A satisfies two properties (the bound on \|U\|_\infty and the bound on |\hat f(\alpha,\beta)|), which it does with probability 1-o(1).

    We now take a die A with these properties and turn our attention to what happens inside this box. First, it is a standard fact about characteristic functions that their derivatives tell us about moments. Indeed,

    \frac{\partial^{r+s}}{\partial^r\alpha\partial^s\beta}\mathbb E e(\alpha U+\beta V)=(2\pi i)^{r+s}\mathbb E U^rV^s e(\alpha U+\beta V),

    and when \alpha=\beta=0 this is \mathbb E U^rV^s. It therefore follows from the two-dimensional version of Taylor’s theorem that

    \hat f(\alpha,\beta)=1-2\pi^2(\alpha^2\mathbb EU^2+2\alpha\beta\mathbb EUV+\beta^2\mathbb EV^2)

    plus a remainder term R(\alpha,\beta) that can be bounded above by a constant times (|\alpha|\|U\|_\infty+|\beta|\|V\|_\infty)^3.

    Writing Q(\alpha,\beta) for 2\pi^2(\alpha^2\mathbb EU^2+2\alpha\beta\mathbb EUV+\beta^2\mathbb EV^2) we have that Q is a positive semidefinite quadratic form in \alpha and \beta. (In fact, it turns out to be positive definite.) Provided R(\alpha,\beta) is small enough, replacing it by zero does not have much effect on \hat f(\alpha,\beta)^n, and provided Q(\alpha,\beta)^2 is small enough, (1-Q(\alpha,\beta))^n is well approximated by \exp(-Q(\alpha,\beta)).

    It turns out, crucially, that the approximations just described are valid in a box that is much bigger than the box inside which \hat f(\alpha,\beta) has a chance of not being small. That implies that the Gaussian decays quickly (and is why we know that Q is positive definite).

    There is a bit of back-of-envelope calculation needed to check this, but the upshot is that the probability that (\sum_iU_i,\sum_iV_i)=(x,y) is very well approximated, at least when x and y aren’t too big, by a formula of the form

    G(x,y)=\int\exp(-Q(\alpha,\beta))e(-\alpha x-\beta y)\mathop{d\alpha}\mathop{d\beta}.

    But this is the formula for the Fourier transform of a Gaussian (at least if we let \alpha and \beta range over \mathbb R^2, which makes very little difference to the integral because the Gaussian decays so quickly), so it is the restriction to \mathbb Z^2 of a Gaussian, just as we wanted.

    When we sum over infinitely many values of x and y, uniform estimates are not good enough, but we can deal with that very directly by using simple measure concentration estimates to prove that the probability that (\sum_iU_i,\sum_iV_i)=(x,y) is very small outside a not too large box.

    That completes the sketch of the main ideas that go into showing that the heuristic argument is indeed correct.

    Any comments about the current draft would be very welcome, and if anyone feels like working on it directly rather than through me, that is certainly a possibility — just let me know. I will try to post soon on the following questions, since it would be very nice to be able to add answers to them.

    1. Is the more general quasirandomness conjecture false, as the experimental evidence suggests? (It is equivalent to the statement that if A and B are two random n-sided dice, then with probability 1-o(1), the four possibilities for whether another die beats A and whether it beats B each have probability \frac 14+o(1).)

    2. What happens in the multiset model? Can the above method of proof be adapted to this case?

    3. The experimental evidence suggests that transitivity almost always occurs if we pick purely random sequences from [n]^n. Can we prove this rigorously? (I think I basically have a proof of this, by showing that whether or not A beats B almost always depends on whether A has a bigger sum than B. I’ll try to find time reasonably soon to add this to the draft.)

    Of course, other suggestions for follow-up questions will be very welcome, as will ideas about the first two questions above.

    http://gowers.wordpress.com/?p=6318
    Extensions
    Intransitive dice V: we want a local central limit theorem
    polymath13
    It has become clear that what we need in order to finish off one of the problems about intransitive dice is a suitable version of the local central limit theorem. Roughly speaking, we need a version that is two-dimensional — that is, concerning a random walk on — and completely explicit — that is, giving […]
    Show full content

    It has become clear that what we need in order to finish off one of the problems about intransitive dice is a suitable version of the local central limit theorem. Roughly speaking, we need a version that is two-dimensional — that is, concerning a random walk on \mathbb Z^2 — and completely explicit — that is, giving precise bounds for error terms so that we can be sure that they get small fast enough.

    There is a recent paper that does this in the one-dimensional case, though it used an elementary argument, whereas I would prefer to use Fourier analysis. Here I’d like to begin the process of proving a two-dimensional result that is designed with our particular application in mind. If we are successful in doing that, then it would be natural to try to extract from the proof a more general statement, but that is not a priority just yet.

    As people often do, I’ll begin with a heuristic argument, and then I’ll discuss how we might try to sharpen it up to the point where it gives us good bounds for the probabilities of individual points of \mathbb Z^2. Much of this post is cut and pasted from comments on the previous post, since it should be more convenient to have it in one place.

    Using characteristic functions

    The rough idea of the characteristic-functions approach, which I’ll specialize to the 2-dimensional case, is as follows. (Apologies to anyone who knows about this properly for anything idiotic I might accidentally write.) Let (X,Y) be a random variable on \mathbb Z^2 and write f(x,y) for \mathbb P[(X,Y)=(x,y)]. If we take n independent copies of (X,Y) and add them together, then the probability of being at (x,y) is

    f*\dots*f(x,y)

    where that denotes the n-fold convolution.

    Now let’s define the Fourier transform of f, which probabilists call the characteristic function, in the usual way by

    \hat f(\alpha,\beta)=\sum_{x,y}f(x,y)e^{2\pi i(\alpha x+\beta y)}.

    Here \alpha and \beta belong to \mathbb R/\mathbb Z, but I’ll sometimes think of them as belonging to [0,1) too.

    We have the convolution law that \widehat{f*g}=\hat f\hat g and the inversion formula

    f(x,y)=\int\int \hat f(\alpha,\beta)e^{-2\pi i(\alpha x+\beta y)}\,d\alpha\,d\beta.

    Putting these together, we find that if random variables (X_i,Y_i) are n independent copies of (X,Y), then the probability that their sum is (x,y) is

    \int\int \hat f(\alpha,\beta)^n e^{-2\pi i(\alpha x+\beta y)}\,d\alpha\,d\beta.

    The very rough reason that we should now expect a Gaussian formula is that we consider a Taylor expansion of f. We can assume for our application that X and Y have mean zero. From that one can argue that the coefficients of the linear terms in the Taylor expansion are zero. (I’ll give more details in a subsequent comment.) The constant term is 1, and the quadratic terms give us the covariance matrix of X and Y. If we assume that we can approximate f(\alpha,\beta) by an expression of the form 1-q(\alpha,\beta) for some suitable quadratic form in \alpha and \beta, then the nth power should be close to \exp(-nq(\alpha,\beta)), and then, since Fourier transforms (and inverse Fourier transforms) take Gaussians to Gaussians, when we invert this one, we should get a Gaussian-type formula for f(x,y). So far I’m glossing over the point that Gaussians are defined on \mathbb R^2, whereas \alpha and \beta live in \mathbb T and x and y live in \mathbb Z^2, but if most of \hat f^n is supported in a small region around 0, then this turns out not to be too much of a problem.

    The Taylor-expansion part

    If we take the formula

    \hat f(\alpha,\beta)=\sum_{x,y}f(x,y)e^{2\pi i(\alpha x+\beta y)}

    and partially differentiate r times with respect to \alpha and s times with respect to \beta we obtain the expression

    (2\pi i)^{r+s}\sum_{x,y}x^ry^sf(x,y)e^{2\pi i(\alpha x+\beta y)}.

    Setting \alpha=\beta=0 turns this into (2\pi i)^{r+s}\mathbb E(X^rY^s). Also, for every \alpha and \beta the absolute value of the partial derivative is at most (2\pi)^{r+s}\mathbb E(X^rY^s). This allows us to get a very good handle on the Taylor expansion of \hat f when \alpha and \beta are close to the origin.

    Recall that the two-dimensional Taylor expansion of \hat f about (0,0) is given by the formula

    \hat f(\alpha,\beta)=\hat f(0,0)+D_1\hat f(0,0)\alpha+D_2\hat f(0,0)\beta+\frac 12D_{11}\hat f(0,0)\alpha^2
    +D_{12}\hat f(0,0)\alpha\beta+\frac 12 D_{22}\hat f(0,0)\beta^2+\mbox{error term}

    where D_1 is the partial derivative operator with respect to the first coordinate, D_{12} the mixed partial derivative, and so on.

    In our case, \hat f(0,0)=\sum_{x,y}f(x,y)=1, D_1\hat f(0,0)=2\pi i\mathbb EX=0, and D_2\hat f(0,0)=2\pi i\mathbb EY=0.

    As in the one-dimensional case, the error term has an integral representation, namely

    \sum_{j=0}^3\binom 3j  \alpha^j\beta^{3-j}\int_0^1 (1-t)^2 (D_1^jD_2^{3-j}\hat f)(t\alpha,t\beta)\,dt,

    which has absolute value at most 8\pi^3\sum_{j=1}^3\binom 3j|\alpha^j\beta^{3-j}\mathbb E X^jY^{3-j}|, which in turn is at most

    8\pi^3\sum_{j=1}^3\binom 3j|\alpha|^j|\beta|^{3-j}\|X\|_\infty^j\|Y\|_\infty^{3-j}=(|\alpha|\|X\|_\infty+|\beta|\|Y\|_\infty)^3.

    When (X,Y) is the random variable (h_A(j),j-(n+1)/2) (where A is a fixed die and j is chosen randomly from [n]), we have that \|Y\|_\infty=(n-1)/2<n.

    With very slightly more effort we can get bounds for the moments of h_A as well. For any particular j and a purely random sequence A=(a_1,\dots,a_n)\in [n]^n, the probability that |h_A(j)|\geq t\sqrt n is bounded above by e^{-ct^2} for an absolute constant c>0. (Something like 1/8 will do.) So the probability that there exists such a j conditional on \sum_ia_i=n(n+1)/2 (which happens with probability about 1/n) is at most Cn^2e^{-ct^2}, and in particular is small when t=100\sqrt{\log n}. I think that with a bit more effort we could probably prove that \mathbb E_j |h_A(j)|^3 is at most Cn^{3/2}, which would allow us to improve the bound for the error term, but I think we can afford the logarithmic factor here, so I won’t worry about this. So we get an error of C(|\alpha|\sqrt{n\log n}+\beta n)^3.

    For this error to count as small, we want it to be small compared with the second moments. For the time being I’m just going to assume that the rough size of the second-moment contribution is around c(|\alpha|\sqrt n+|\beta|n)^2. So for our error to be small, we want \alpha to be o(1/\sqrt{n}(\log n)^{3/2}) and \beta to be o(1/n).

    That is giving us a rough idea of the domain in which we can say confidently that the terms up to the quadratic ones give a good approximation to \hat f, and hence that \hat f^n is well approximated by a Gaussian.

    The remaining part

    Outside the domain, we have to do something different, and that something is fairly simple: we shall show that \hat f^n is very small. This is equivalent to showing that |\hat f| is bounded away from 1 by significantly more than 1/n. This we do by looking more directly at the formula for the Fourier transform:

    \hat f(\alpha,\beta)=\sum_{x,y}f(x,y)e^{2\pi i(\alpha x+\beta y)}.

    We would like this to have absolute value bounded away from 1 by significantly more than 1/n except when \alpha is quite a bit smaller than n^{-1/2}(\log n)^{-3/2} and \beta is quite a bit smaller than n^{-1}.

    Now in our case f is uniformly distributed on the n points (h_A(j),j-(n+1)/2). So we can write \hat f as

    \hat f(\alpha,\beta)=n^{-1}\sum_je^{2\pi i(\alpha h_A(j)+\beta(j-(n+1)/2))}.

    Here’s a possible way that we might try to bound that sum. Let m=n/2 and let us split up the sum into pairs of terms with j and j+m, for j\leq m. So each pair of terms will take the form

    e^{2\pi i(\alpha h_A(j)+\beta(j-(n+1)/2))}+e^{2\pi i(\alpha h_A(j+m)+\beta(j+m-(n+1)/2))}.

    The ratio of these two terms is

    e^{2\pi i(\alpha(h_A(j)-h_A(j+m))-\beta m)}.

    And if the ratio is e^{i\theta}, then the modulus of the sum of the two terms is at most 2(1-c\theta^2).

    Now let us suppose that as j varies, the differences h_A(j)-h_A(j+m) are mostly reasonably well distributed in an interval between -\sqrt n and \sqrt n, as seems very likely to be the case. Then the ratios above vary in a range from about -\alpha\sqrt n to \alpha\sqrt n. But that should imply that the entire sum, when divided by n, has modulus at most 1-c\alpha^2n. (This analysis obviously isn’t correct when \alpha is bigger than n^{-1/2}, since the modulus can’t be negative, but once we’re in that regime, then it really is easy to establish the bounds we want.)

    If \alpha is, say n^{-2/3}, then this gives us 1-cn^{-1/3}, and raising that to the power n gives us e^{-cn^{2/3}}, which is tiny.

    As a quick sanity check, note that for (1-c\alpha^2n)^n not to be tiny we need \alpha to be not much more than n^{-1}. This reflects the fact that a random walk of n steps of typical size about \sqrt n will tend to be at a distance comparable to n from the origin, and when you take the Fourier transform, you take the reciprocals of the distance scales.

    If \alpha is quite a bit smaller than n^{-1/2} and \beta is not too much smaller than n^{-1}, then the numbers \alpha h_A(j) are all small but the numbers \beta(j-(n+1)/2) vary quite a bit, so a similar argument can be used to show that in this case too the Fourier transform is not close enough to 1 for its nth power to be large. I won’t give details here.

    What is left to do?

    If the calculations above are not too wide of the mark, then the main thing that needs to be done is to show that for a typical die A the numbers h_A(j) are reasonably uniform in a range of width around \sqrt n, and more importantly that the numbers h_A(j)-h_A(j+m) are not too constant: basically I’d like them to be pretty uniform too.

    It’s possible that we might want to try a slightly different approach, which is to take the uniform distribution on the set of points (h_A(j),j-(n+1)/2), convolve it once with itself, and argue that the resulting probability distribution is reasonably uniform in a rectangle of width around n and height around \sqrt n. By that I mean that a significant proportion of the points are hit around \sqrt n times each (because there are n^2 sums and they lie in a rectangle of area n^{3/2}). But one way or another, I feel pretty confident that we will be able to bound this Fourier transform and get the local central limit theorem we need.

    http://gowers.wordpress.com/?p=6289
    Extensions
    Intransitive dice IV: first problem more or less solved?
    polymath13
    I hope, but am not yet sure, that this post is a counterexample to Betteridge’s law of headlines. To back up that hope, let me sketch an argument that has arisen from the discussion so far, which appears to get us close to showing that if and are three -sided dice chosen independently at random […]
    Show full content

    I hope, but am not yet sure, that this post is a counterexample to Betteridge’s law of headlines. To back up that hope, let me sketch an argument that has arisen from the discussion so far, which appears to get us close to showing that if A,B and C are three n-sided dice chosen independently at random from the sequences model (I will recap some of these definitions in a moment), then the probability that A beats C given that A beats B and B beats C is 1/2+o(1). Loosely speaking, the information that A beats B and B beats C tells you nothing about how likely A is to beat C. Having given the argument, I will try to isolate a statement that looks plausible, not impossible to prove, and sufficient to finish off the argument.

    Basic definitions

    An nsided die in the sequence model is a sequence (a_1,\dots,a_n) of elements of [n]=\{1,2,\dots,n\} such that \sum_ia_i=n(n+1)/2, or equivalently such that the average value of a_i is (n+1)/2, which is of course the average value of a random element of [n]. A random n-sided die in this model is simply an n-sided die chosen uniformly at random from the set of all such dice.

    Given n-sided dice A=(a_1,\dots,a_n) and B=(b_1,\dots,b_n), we say that A beats B if

    |\{(i,j):a_i>b_j\}|>|\{(i,j):a_i<b_j\}|.

    If the two sets above have equal size, then we say that A ties with B.

    A first reduction

    When looking at this problem, it is natural to think about the following directed graph: the vertex set is the set of all n-sided dice and we put an arrow from A to B if A beats B.

    We believe (and even believe we can prove) that ties are rare. Assuming that to be the case, then the conjecture above is equivalent to the statement that if A,B,C are three vertices chosen independently at random in this graph, then the probability that ABC is a directed cycle is what you expect for a random tournament, namely 1/8.

    One can also make a more general conjecture, namely that the entire (almost) tournament is quasirandom in a sense defined by Chung and Graham, which turns out to be equivalent to the statement that for almost all pairs (A,B) of dice, the four possible pairs of truth values for the pair of statements

    (A beats C, B beats C)

    each occur with probability approximately 1/4. If this is true, then given k random dice A_1,\dots,A_k, all the 2^{\binom k2} possibilities for which A_i beat which A_j have probability approximately 2^{-\binom k2}. This would imply, for example, that if A_1,\dots,A_k are independent random n-sided dice, then the probability that A_1 beats A_k given that A_i beats A_j for all other pairs (i,j) with i<j is still 1/2+o(1).

    Several of us have done computer experiments to test these conjectures, and it looks as though the first one is true and the second one false. A further reason to be suspicious of the stronger conjecture is that a natural approach to prove it appears to be morally equivalent to a relationship between the correlations of certain random variables that doesn’t seem to have any heuristic justification or to fit with experimental evidence. So although we don’t have a disproof of the stronger conjecture (I think it would be very interesting to find one), it doesn’t seem like a good idea to spend a lot of effort trying to prove it, unless we can somehow explain away the evidence that appears to be stacking up against it.

    The first conjecture turns out to be equivalent to a statement that doesn’t mention transitivity. The very quick proof I’ll give here was supplied by Luke Pebody. Suppose we have a tournament (that is, a complete graph with each edge directed in one of the two possible directions) and write d_+(x) for the out-degree of a vertex x (that is, the number of y such that there is an arrow from x to y) and d_-(x) for the in-degree. Then let us count the number of ordered triples (x,y,z) such that x\to y\to z. Any directed triangle xyz in the tournament will give rise to three such triples, namely (x,y,z), (y,z,x) and (z,x,y). And any other triangle will give rise to just one: for example, if x\to y, x\to z and y\to z we get just the triple (x,y,z). So the number of ordered triples (x,y,z) such that x\to y and y\to z is \binom n3 plus twice the number of directed triangles. Note that \binom n3 is approximately n^3/6.

    But the number of these ordered triples is also \sum_yd_+(y)d_-(y). If almost all in-degrees and almost all out-degrees are roughly n/2, then this is approximately n^3/4, which means that the number of directed triangles is approximately (n^3/4-n^3/6)/2\approx\binom n3/4. That is, in this case, the probability that three dice form an intransitive triple is approximately 1/4, as we are hoping from the conjecture. If on the other hand several in-degrees fail to be roughly n/2, then \sum_yd_+(y)d_-(y) is substantially lower than n^3/4 and we get a noticeably smaller proportion of intransitive triples.

    Thus, the weaker conjecture is equivalent to the statement that almost every die beats approximately half the other dice.

    Why should a “typical” die beat approximately half the other dice?

    The answer to this is fairly simple, heuristically at least. Let A be an arbitrary die. For j=1,2,\dots,n define g_A(j) to be the number of i with a_i\leq j and define h_A(j) to be g_A(j)-j. Then

    \sum_jg_A(j)=\sum_j\sum_i\mathbf 1_{[a_i\leq j]}=\sum_i(n-a_i+1)

    =n(n+1)-\sum_ia_i=n(n+1)/2,

    from which it follows that \sum_jh_A(j)=0.

    We also have that if B is another die, then

    \sum_jh_A(b_j)=\sum_j(\sum_i\mathbf 1_{a_i\leq b_j}-b_j)=\sum_{i,j}\mathbf 1_{[a_i\leq b_j]}-n(n+1)/2.

    If we make the simplifying assumption that a_i=b_j sufficiently infrequently to make no real difference to what is going on (which is not problematic, as a slightly more complicated but still fairly simple function can be used instead of h_A to avoid this problem), then we find that to a reasonable approximation B beats A if and only if \sum_jh_A(b_j) is positive.

    So what we would like to prove is that if b_1,\dots,b_n are chosen independently at random from [n], then

    \mathbb P\Bigl[\sum_jh_A(b_j)>0\Bigm|\sum_jb_j=n(n+1)/2\Bigr]\approx 1/2.

    We are therefore led to consider the random variable

    (X,Y)=(\sum_jh_A(b_j),\sum_j(b_j-(n+1)/2))

    where now B=(b_1,\dots,b_n) is chosen uniformly at random from [n]^n without any condition on the sum. To write this in a more transparent way, let (U,V) be the random variable (h_A(x), x-(n+1)/2), where x is chosen uniformly at random from [n]. Then (X,Y) is a sum of n independent copies of (U,V). What we are interested in is the distribution we obtain when we condition the random variable (X,Y) on Y=0.

    This should mean that we are in an excellent position, since under appropriate conditions, a lot is known about sums of independent random variables, and it looks very much as though those conditions are satisfied by (U,V), at least when A is “typical”. Indeed, what we would expect, by the central limit theorem, is that (X,Y) will approximate a bivariate normal distribution with mean 0 (since both U and V have mean zero). But a bivariate normal distribution is centrally symmetric, so we expect the distribution of [X|Y=0] to be approximately centrally symmetric, which would imply what we wanted above, since that is equivalent to the statement that \mathbb P[X>0|Y=0]\approx 1/2.

    Why are we not immediately done?

    How can we make the above argument rigorous? The central limit theorem on its own is not enough, for two reasons. The first is that it does not give us information about the speed of convergence to a normal distribution, whereas we need a sum of n copies of (U,V) to be close to normal. The second is that the notion of “close to normal” is not precise enough for our purposes: it will allow us to approximate the probability of an event such as [X\geq x\wedge Y\geq y] but not of a “probability zero” event such as [X>0\wedge Y=0].

    The first of these difficulties is not too worrying, since plenty of work has been done on the speed of convergence in the central limit theorem. In particular, there is a famous theorem of Berry and Esseen that is often used when this kind of information is needed.

    However, the Berry-Esseen theorem still suffers from the second drawback. To get round that one needs to turn to more precise results still, known as local central limit theorems, often abbreviated to LCLTs. With a local central limit theorem, one can even talk about the probability that (X,Y) takes a specific value after a specific number of steps. Roughly speaking, it says (in its 2-dimensional version) that if (U,V) is a random variable of mean zero taking values in \mathbb Z^2 and if (U,V) satisfies suitable moment conditions and is not supported in a proper sublattice of \mathbb Z^2, then writing (X,Y) for a sum of n copies of (U,V), we have that the probability that (X,Y) takes a particular value differs from the “expected” probability (given by a suitable Gaussian formula) by O(n^{-2}). (I’m not 100% sure I’ve got that right: the theorem in question is Theorem 2.1.1 from this book.)

    That looks very close to what we want, but it still falls short. The problem is that the implied constant depends on the random variable (U,V). A simple proof of this is that if (U,V) is not supported in a sublattice but very nearly is — for example, if the probability that it takes a value outside the sublattice is 2^{-2^n} — then one will have to add together an extremely large number of copies of (U,V) before the sum ceases to be concentrated in the sublattice.

    So the situation we appear to be in is the following. We have more precise information about the random variable (U,V) than is assumed in the LCLT in the reference above, and we want to use that to obtain an explicit constant in the theorem.

    It could be that out there in the literature is exactly the result we need, which would be nice, but it also seems possible that we will have to prove an appropriate version of the LCLT for ourselves. I’d prefer the first, but the second wouldn’t be too disappointing, as the problem is quite appealing and even has something of an additive-combinatorial flavour (since it is about describing an iterated convolution of a subset of \mathbb Z^2 under appropriate assumptions).

    What properties can we establish about the random variable (U,V)?

    I said above, with no justification, that we have more precise information about the random variable (U,V). Let me now try to give the justification.

    First of all, we know everything we could possibly want to know about V: it is the uniform distribution on [n] - (n+1)/2. (In particular, if n is odd, then it is the uniform distribution on the set of integers in [-(n-1)/2, (n-1)/2].)

    How about the distribution of U? That question is equivalent to asking about the values taken by h_A(j), and their multiplicities. There is quite a lot one can say about those. For example, I claim that with high probability (if A is a random n-sided die) h_A(j) is never bigger than C\sqrt{n\log n}. That is because if we choose a fully random sequence (a_1,\dots,a_n)\in[n]^n, then the expected number of i such that a_i\leq j is j, and the probability that this number differs from j by more than t\sqrt n is \exp(-ct^2), by standard probabilistic estimates, so if we set t=C\sqrt{\log n}, then this is at most n^{-cC}, which we can make a lot smaller than n^{-1} by choosing C to be, say, 10c^{-1}. (I think c can be taken to be 1/8 if you want me to be more explicit.) Since the probability that \sum_ia_i=n(n+1)/2 is proportional to 1/n, it follows that this conclusion continues to hold even after we condition on that event.

    Another simple observation is that the values taken by U are not contained in a sublattice (assuming, that is, that h_A is ever non-zero). That is simply because h_A(j+1)\geq h_A(j)-1 and h_A averages zero.

    A third simple observation is that with probability 1-o(1) h_A will take a value of at least \sqrt n/\log n at least somewhere. I’ll sketch a proof of this. Let k be around \log n and let m_1<\dots<m_k be evenly spaced in [n], staying away from the end points 1 and n. Let A be a purely random sequence in [n]^n. Then the standard deviation of h_A(m_1) is around \sqrt{n/\log n}, so the probability that it is less than \sqrt n/\log n is around (\log n)^{-1/2}. The same is true of the conditional probability that h_A(m_i) is less than \sqrt n/\log n conditioned on the value of h_A(m_{i-1}) (the worst case being when this value is 0). So the probability that this happens for every i is at most (\log n)^{-\log n/2}. This is much smaller than n^{-1}, so the conclusion remains valid when we condition on the sum of the a_i being n(n+1)/2. So the claim follows. Note that because of the previous simple observation, it follows that h_A must be at least c\sqrt{n/\log n} in magnitude at least c\sqrt{n/\log n} times, so up to log factors we get that \sum_jh_A(j)^2 is at least n^{3/2}. With a bit more effort, it should be possible to push this up to something more like n^2, since one would expect that h_A(j) would have rough order of magnitude \sqrt n for a positive fraction of the j. Maybe this would be a good subproblem to think about, and ideally not too difficult.

    How about the joint distribution (U,V)? It seems highly likely that for typical A this will not be concentrated in a lattice, and that elementary arguments such as the above can be used to prove this. But let me indicate the kind of situation that we would have to prove is not typical. Suppose that n=15 and A=(1,1,1,1,1,8,8,8,8,8,15,15,15,15,15). Then as j runs from 1 to 15 the values taken by g_A are (5,5,5,5,5,5,5,10,10,10,10,10,10,10,15) and the values taken by h_A are (4,3,2,1,0,-1,-2,2,1,0,-1,-2,-3,-4,0). For this example, all the points (h_A(x),x) live in the lattice of points (u,v) such that u+v is a multiple of 5.

    This wouldn’t necessarily be a disaster for us actually, since the LCLT can be restricted to a sublattice and if after conditioning on Y=0 we happen to have that X is always a multiple of 5, that isn’t a problem if we still have the central symmetry. But it would probably be nicer to prove that it is an atypical occurrence, so that we don’t have to worry about (U,V) living inside a sublattice (or even being concentrated in one).

    My guess is that if we were to pursue these kinds of thoughts, we would end up being able to prove a statement that would say something like that (U,V) takes a pretty representative sample of values with V being between -(n-1)/2 and (n-1)/2 and U being in a range of width around \sqrt n. I would expect, for example, that if we add three or four independent copies of (U,V), then we will have a distribution that is similar in character to the uniform distribution on a rectangle of width of order of magnitude n and height of order of magnitude \sqrt n. And if that’s true, then adding n of them should give us something very close to normal (in an appropriate discrete sense of the word “normal”).

    What next?

    There are two obvious tasks here. One is to try to prove as much as we can about the random variable (U,V). The other is to try to prove a suitable LCLT that is strong enough to give us that the probability that X>0 given that Y=0 is approximately 1/2, under suitable assumptions about (U,V). And then we have to hope that what we achieve for the first is sufficient for the second.

    It’s possible that the second task can be achieved by simply going through one of the existing proofs of the LCLT and being more careful about the details. But if that’s the case, then we should spend some time trying to find out whether anyone has done it already, since there wouldn’t be much point in duplicating that work. I hope I’ve set out what we want clearly enough for any probabilist who might stumble upon this blog post to be able to point us in the right direction if indeed the result we want is out there somewhere.

    http://gowers.wordpress.com/?p=6269
    Extensions
    Intransitive dice III
    polymath13
    I now feel more optimistic about the prospects for this project. I don’t know whether we’ll solve the problem, but I think there’s a chance. But it seems that there is after all enough appetite to make it an “official” Polymath project. Perhaps we could also have an understanding that the pace of the project […]
    Show full content

    I now feel more optimistic about the prospects for this project. I don’t know whether we’ll solve the problem, but I think there’s a chance. But it seems that there is after all enough appetite to make it an “official” Polymath project. Perhaps we could also have an understanding that the pace of the project will be a little slower than it has been for most other projects. I myself have various other mathematical projects on the boil, so can’t spend too much time on this one, but quite like the idea of giving it an occasional go when the mood takes me, and trying to make slow but steady progress. So I’ve created a polymath13 category, into which this post now fits. I’ve also retrospectively changed the category for the previous two posts. I don’t think we’ve got to the stage where a wiki will be particularly useful, but I don’t rule that out at some point in the future.

    In this post I want to expand on part of the previous one, to try to understand better what would need to be true for the quasirandomness assertion to be true. I’ll repeat a few simple definitions and simple facts needed to make the post more self-contained.

    By an nsided die I mean a sequence in [n]^n (where [n] is shorthand for \{1,2,\dots,n\}) that adds up to n(n+1)/2. Given an n-sided die A=(a_1,\dots,a_n) and j\in[n], I define g_A(j) to be the number of i such that a_i\leq j and h_A(j) to be g_A(j)-j.

    We can write h_A(j) as \sum_i\mathbf 1_{[a_i\leq j]}-j. Therefore, if C=(c_1,\dots,c_n) is another die, or even just an arbitrary sequence in [n]^n, we have that
    \sum_jh_A(c_j)=\sum_j(\sum_i\mathbf 1_{[a_i\leq c_j]}-c_j)=\sum_{i,j}\mathbf 1_{[a_i\leq c_j]} - \sum_jc_j.
    If \sum_jc_j=n(n+1)/2 and no a_i is equal to any c_j, then the sign of this sum therefore tells us whether A beats C. For most A, we don’t expect many ties, so the sign of the sum is a reasonable, but not perfect, proxy for which of the two dice wins. (With a slightly more complicated function we can avoid the problem of ties: I shall stick with the simpler one for ease of exposition, but would expect that if proofs could be got to work, then we would switch to the more complicated functions.)

    This motivates the following question. Let A and B be two random dice. Is it the case that with high probability the remaining dice C are split into four sets of roughly equal size according to the signs of h_A(C) and h_B(C)? I expect the answer to this question to be the same as the answer to the original transitivity question, but I haven’t checked as carefully as I should that my cavalier approach to ties isn’t problematic.

    I propose the following way of tackling this question. We fix A and B and then choose a purely random sequence C=(c_1,\dots,c_n) (that is, with no constraint on the sum) and look at the 3D random variable
    (\sum_jh_A(c_j),\sum_jh_B(c_j),\sum_j(c_j-(n+1)/2)).
    Each coordinate separately is a sum of independent random variables with mean zero, so provided not too many of the h_A or h_B are zero, which for random A and B is a reasonable assumption, we should get something that approximates a trivariate normal distribution.

    Therefore, we should expect that when we condition on \sum_j(c_j-(n+1)/2) being zero, we will get something that approximates a bivariate normal distribution. Although that may not be completely straightforward to prove rigorously, tools such as the Berry-Esseen theorem ought to be helpful, and I’d be surprised if this was impossibly hard. But for now I’m aiming at a heuristic argument, so I want simply to assume it.

    What we want is for the signs of the first two coordinates to be approximately independent, which I think is equivalent to saying (assuming normality) that the first two coordinates themselves are approximately independent.

    However, what makes the question interesting is that the first two coordinates are definitely not independent without the conditioning: the random variables \sum_jh_A(c_j) and \sum_jh_B(c_j) are typically quite strongly correlated. (There are good reasons to expect this to be the case, and I’ve tested it computationally too.) Also, we expect correlations between these variables and \sum_j(c_j-(n+1)/2). So what we are asking for is that all these correlations should disappear when we condition appropriately. More geometrically, there is a certain ellipsoid, and we want its intersection with a certain plane to be a circle.

    The main aim of this post is to make the last paragraph more precise. That is, I want to take three standard normal random variables X, Y and Z that are not independent, and understand precisely the circumstances that guarantee that X and Y become independent when we condition on Z.

    The joint distribution of (X,Y,Z) is determined by the matrix of correlations. Let this matrix be split up as \begin{pmatrix}\Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22}\\ \end{pmatrix}, where \Sigma_{11} is the 2\times 2 covariance matrix of (X,Y), \Sigma_{12} is a 2\times 1 matrix, \Sigma_{21} is a 1\times 2 matrix and \Sigma_{22} is the matrix (1). A general result about conditioning joint normal distributions on a subset of the variables tells us, if I understand the result correctly, that the covariance matrix of (X,Y) when we condition on the value of Z is \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}. (I got this from Wikipedia. It seems to be quite tricky to prove, so I hope it really can be used as a black box.) So in our case if we have a covariance matrix \begin{pmatrix}1&\lambda&\mu\\ \lambda&1&\nu\\ \mu&\nu&1\\ \end{pmatrix} then the covariance matrix of (X,Y) conditioned on Z should be \begin{pmatrix}1-\mu^2&\lambda-\mu\nu\\ \lambda-\mu\nu&1-\nu^2\\ \end{pmatrix}.

    That looks dimensionally odd because I normalized the random variables to have variance 1. If instead I had started with the more general covariance matrix \begin{pmatrix}a&\lambda&\mu\\ \lambda&b&\nu\\ \mu&\nu&c\\ \end{pmatrix} I would have ended up with \begin{pmatrix}a-c^{-1}\mu^2&\lambda-c^{-1}\mu\nu\\ \lambda-c^{-1}\mu\nu&b-c^{-1}\nu^2\\ \end{pmatrix}.

    So after the conditioning, if we want X and Y to become independent, we appear to want \lambda c to equal \mu\nu. That is, we want
    \langle X,Z\rangle\langle Y,Z\rangle=\langle X,Y\rangle\langle Z,Z\rangle
    where I am using angle brackets for covariances.

    If we divide each variable by its standard deviation, that gives us that the correlation between X and Y should be the product of the correlation between X and Z and the correlation between Y and Z.

    I wrote some code to test this, and it seemed not to be the case, anything like, but I am not confident that I didn’t make careless mistakes in the code. (However, my correlations were reasonable numbers in the range [-1,1], so any mistakes there might have been didn’t jump out at me. I might just rewrite the code from scratch without looking at the old version.)

    One final remark I’d like to make is that if you feel there is something familiar about the expression \langle x,z\rangle\langle y,z\rangle-\langle x,y\rangle\langle z,z\rangle, then you are not completely wrong. The formula for the vector triple product is

    x\wedge(y\wedge z)=\langle x,z\rangle y - \langle x,y\rangle z.

    Therefore, the expression can be condensed to \langle x\wedge(y\wedge z),z\rangle. Now this is the scalar triple product of the three vectors x, y\wedge z, and z. For this to be zero, we need x to lie in the plane generated by y\wedge z and z. Note that y\wedge z is orthogonal to both y and z. So if P is the orthogonal projection to the subspace generated by z, we want x-Px to be orthogonal to y. Actually, that can be read out of the original formula too, since it is \langle\langle x,z\rangle z - \langle z,z\rangle x,y\rangle. A nicer way of thinking of it (because more symmetrical) is that we want the orthogonal projections of x and y to the subspace orthogonal to z to be orthogonal. To check that, assuming (WLOG) that \|z\|_2=1,
    \langle x-\langle x,z\rangle z,y-\langle y,z\rangle z\rangle=\langle x,y\rangle-\langle x,z\rangle\langle y,z\rangle.

    So what I’d like to see done (but I’m certainly not saying it’s the only thing worth doing) is the following.

    1. Test experimentally whether for a random pair (A,B) of n-sided dice we find that the correlations of the random variables X=\sum_jh_A(j), Y=\sum_jh_B(j) and Z=\sum_j(c_j-(n+1)/2) really do appear to satisfy the relationship

    corr(X,Z).corr(Y,Z)= corr(X,Y).

    Here the c_j are chosen randomly without any conditioning on their sum. My experiment seemed to indicate not, but I’m hoping I made a mistake.

    2. If they do satisfy that relationship, then we can start to think about why.

    3. If they do not satisfy it, then we can start to think about why not. In particular, which of the heuristic assumptions used to suggest that they should satisfy that relationship is wrong — or is it my understanding of multivariate normals that is faulty?

    If we manage to prove that they typically do satisfy that relationship, at least approximately, then we can think about whether various distributions become sufficiently normal sufficiently quickly for that to imply that intransitivity occurs with probability 1/4.

    http://gowers.wordpress.com/?p=6251
    Extensions
    Intransitive dice II
    polymath13
    I’m not getting the feeling that this intransitive-dice problem is taking off as a Polymath project. However, I myself like the problem enough to want to think about it some more. So here’s a post with some observations and with a few suggested subproblems that shouldn’t be hard to solve and that should shed light […]
    Show full content

    I’m not getting the feeling that this intransitive-dice problem is taking off as a Polymath project. However, I myself like the problem enough to want to think about it some more. So here’s a post with some observations and with a few suggested subproblems that shouldn’t be hard to solve and that should shed light on the main problem. If the rate of comments by people other than me doesn’t pick up, then I think I’ll simply conclude that there wasn’t sufficient interest to run the project. However, if I do that, I have a back-up plan, which is to switch to a more traditional collaboration — that is, done privately with a small number of people. The one non-traditional aspect of it will be that the people who join the collaboration will select themselves by emailing me and asking to be part of it. And if the problem gets solved, it will be a normal multi-author paper. (There’s potentially a small problem if someone asks to join in with the collaboration and then contributes very little to it, but we can try to work out some sort of “deal” in advance.)

    But I haven’t got to that point yet: let me see whether a second public post generates any more reaction.

    I’ll start by collecting a few thoughts that have already been made in comments. And I’ll start that with some definitions. First of all, I’m going to change the definition of a die. This is because it probably makes sense to try to prove rigorous results for the simplest model for which they are true, and random multisets are a little bit frightening. But I am told that experiments suggest that the conjectured phenomenon occurs for the following model as well. We define an n-sided die to be a sequence A=(a_1,\dots,a_n) of integers between 1 and n such that \sum_ia_i=n(n+1)/2. A random n-sided die is just one of those chosen uniformly from the set of all of them. We say that A beats B if
    \sum_{i,j}\mathbf 1_{[a_i>b_j]}>\sum_{i,j}\mathbf 1_{[a_i<b_j]}.
    That is, A beats B if the probability, when you roll the two dice, that A shows a higher number than B is greater than the probability that B shows a higher number than A. If the two probabilities are equal then we say that A ties with B.

    The main two conjectures are that the probability that two dice tie with each other tends to zero as n tends to infinity and that the “beats” relation is pretty well random. This has a precise meaning, but one manifestation of this randomness is that if you choose three dice A, B and C uniformly at random and are given that A beats B and B beats C, then the probability that A beats C is, for large n, approximately 1/2. In other words, transitivity doesn’t happen any more often than it does for a random tournament. (Recall that a tournament is a complete graph in which every edge is directed.)

    Now let me define a function that helps one think about dice. Given a die A, define a function f_A on the set [n]=\{1,2,\dots,n\} by
    f_A(j)=\sum_i(\mathbf 1_{[a_i>j]}-\mathbf 1_{[a_i<j]})=|\{i:a_i>j\}|-|\{i:a_i<j\}|.
    Then it follows immediately from the definitions that A beats B if \sum_jf_A(b_j)>0, which is equivalent to the statement that \sum_jf_B(a_j)<0.

    If the “beats” tournament is quasirandom, then we would expect that for almost every pair of dice A,B the remaining dice are split into four parts of roughly equal sizes, according to whether they beat A and whether they beat B. So for a typical pair of dice A,B we would like to show that \sum_jf_A(c_j)>0 for roughly half of all dice C, and \sum_jf_B(c_j)>0 for roughly half of all dice C, and that these two events have almost no correlation.

    It is critical here that the sums should be fixed. Otherwise, if we are told that A beats B, the most likely explanation is that the sum of A is a bit bigger than the sum of B, and then A is significantly more likely to beat C than B is.

    Note that for every die A we have
    \sum_jf_A(j)=\sum_{i,j}(\mathbf 1_{[a_i>j]}-\mathbf 1_{[a_i<j]})
    =\sum_i((a_i-1)-(n-a_i))=2\sum_ia_i-n(n+1)=0.
    That is, every die ties with the die (1,2,\dots,n).

    Now let me modify the functions f_A to make them a bit easier to think about, though not quite as directly related to the “beats” relation (though everything can be suitably translated). Define g_A(j) to be |\{i:a_i\leq j\}| and h_A(j) to be g_A(j)-j. Note that f_A(j)=(n-g_A(j))-g_A(j-1) which would normally be approximately equal to n-2g_A(j).

    We are therefore interested in sums such as \sum_jg_A(c_j). I would therefore like to get a picture of what a typical sequence (g_A(1),\dots,g_A(n)) looks like. I’m pretty sure that g_A(j) has mean j. I also think it is distributed approximately normally around j. But I would also like to know about how g_A(j_1) and g_A(j_2) correlate, since this will help us get some idea of the variance of \sum_jg_A(c_j), which, if everything in sight is roughly normal, will pin down the distribution. I’d also like to know about the covariance of \sum_jg_A(c_j) and \sum_jg_B(c_j), or similar quantities anyway, but I don’t want to walk before I can fly.

    Anyhow, I had the good fortune to see Persi Diaconis a couple of days ago, and he assured me that the kind of thing I wanted to understand had been studied thoroughly by probabilists and comes under the name “constrained limit theorems”. I’ve subsequently Googled that phrase and found some fairly old papers written in the typical uncompromising style and level of generality of their day, which leaves me thinking that it may be simpler to work a few things out from scratch. The main purpose of this post is to set out some exercises that have that as their goal.

    What is the average of g_A(j)?

    Suppose, then, that we have a random n-sided die A. Let’s begin by asking for a proper proof that the mean of g_A(j) is j. It clearly is if we choose a purely random n-tuple of elements of [n], but what happens if we constrain the average to be (n+1)/2?

    I don’t see an easy proof. In fact, I’m not sure it’s true, and here’s why. The average will always be j if and only if the probability that a_1\leq j is always equal to j/n, and that is true if and only if a_1 is uniformly distributed. (The distributions of the a_i are of course identical, but — equally of course — not independent.) But do we expect a_1 to be uniformly distributed? No we don’t: if a_1=(n+1)/2 that will surely make it easier for the global average to be (n+1)/2 than if a_1=n.

    However, I would be surprised if it were not at least approximately true. Here is how I would suggest proving it. (I stress that I am not claiming that this is an unknown result, or something that would detain a professional probabilist for more than two minutes — that is why I used the word “exercise” above. But I hope these questions will be useful exercises.)

    The basic problem we want to solve is this: if a_1,\dots,a_n are chosen independently and uniformly from [n], then what is the conditional probability that a_1=j given that the average of the a_i is exactly (n+1)/2?

    It’s not the aim of this post to give solutions, but I will at least say why I think that the problems aren’t too hard. In this case, we can use Bayes’s theorem. Using well-known estimates for sums of independent random variables, we can give good approximations to the probability that the sum is n(n+1)/2 and of the probability of that given that a_1=j (which is just the probability that the sum of the remaining n-1 a_is is n(n+1)/2-j\ ). We also know that the probability that a_1=j is 1/n. So we have all the information we need. I haven’t done the calculation, but my guess is that the tendency for a_1 to be closer to the middle than to the extremes is not very pronounced.

    In fact, here’s a rough argument for that. If we choose a_i uniformly from [n], then the variance is about n^2/12. So the variance of the sum of the a_i (in the fully independent case) is about n^3/12, so the standard deviation is proportional to n^{3/2}. But if that’s the case, then the probability that the sum equals n(n+1)/2+t is roughly constant for t=O(n).

    I think it should be possible to use similar reasoning to prove that if m=o(\sqrt n), then a_1,\dots,a_m are approximately independent. (Of course, this would apply to any m of the a_i, if correct.)

    How is g_A(j) distributed?

    What is the probability that j+s of the a_i are at most j? Again, it seems to me that Bayes’s theorem and facts about sums of independent random variables are enough for this. We want the probability of the above event given that \sum_ia_i=n(n+1)/2. By Bayes’s theorem, we can work this out if we know the probability that \sum_ia_i=n(n+1)/2 given that g_A(j)=j+s, together with the probability that \sum_ia_i=n(n+1)/2 and the probability that g_A(j)=j+s, in both cases when A is chosen fully independently. The last two calculations are simple. The first one isn’t 100% simple, but it doesn’t look too bad. We have a sum of j+s random variables that are uniform on [j] and n-j-s that are uniform on \{j+1,\dots,n\} and we want to know how likely it is that they add up to n(n+1)/2. We could do this by conditioning on the possible values of the two sums, which then leaves us with sums of independent variables, and adding up all the results. It looks to me as though that calculation shouldn’t be too unpleasant. What I would recommend is to do the calculation on the assumption that the distributions are normal (in a suitable discrete sense) with whatever mean and variance they have to have, since that will yield an answer that is almost certainly correct. A rigorous proof can come later, and shouldn’t be too much harder.

    The answer I expect and hope for is that g_A(j) is approximately normally distributed with mean j and a variance that would come out of the calculations.

    What is the joint distribution of g_A(j_1) and g_A(j_2)?

    This can in principle be done by exactly the same technique, except that now things get one step nastier because we have to condition on the sum of the a_i that are at most j_1, the sum of the a_i that are between j_1+1 and j_2, and the sum of the rest. So we end up with a double sum of products of three probabilities at the end instead of a single sum of products of two probabilities. The reason I haven’t done this is that I am quite busy with other things and the calculation will need a strong stomach. I’d be very happy if someone else did it. But if not, I will attempt it at some point over the next … well, I don’t want to commit myself too strongly, but perhaps the next week or two. At this stage I’m just interested in the heuristic approach — assume that probabilities one knows are roughly normal are in fact given by an exact formula of the form Ae^{-\lambda (x-\mu)^2}.

    For some experimental evidence about this, see a comment by Ian on the previous post, which links to some nice visualizations. Ian, if you’re reading this, it would take you about another minute, I’d have thought, to choose a few random dice A and plot the graphs h_A. It would be interesting to see such plots to get an idea of what a typical one looks like: roughly how often does it change sign, for example?

    What is a heuristic argument for the “A beats B” tournament being quasirandom?

    I have much less to say here — in particular, I don’t have a satisfactory answer. But I haven’t spent serious time on it, and I think it should be possible to get one.

    One slight simplification is that we don’t have to think too hard about whether A beats B when we are thinking about the three dice A, B and C. As I commented above, the tournament will be quasirandom (I think I’m right in saying) if for almost every A and B the events “A beats C” and “B beats C” have probability roughly 1/2 each and are hardly correlated.

    A good starting point would be the first part. Is it true that almost every die beats approximately half the other dice? This question was also recommended by Bogdan Grechuk in a comment on the previous post. He suggested, as a preliminary question, the question of finding a good sufficient condition on a die for this to be the case.

    That I think is approachable too. Let’s fix some function g_A without worrying too much about whether it comes from a die (but I have no objection to assuming that it is non-decreasing and that g_A(n)=n, should that be helpful). Under what conditions can we be confident that the sum \sum_jg_A(c_j) is greater than n(n+1)/2 with probability roughly 1/2, where C=(c_1,\dots,c_n) is a random die?

    Assuming it’s correct that each c_j is roughly uniform, \sum_jg_A(c_j) is going to average \sum_jg_A(j), which if A is a die will be close to n(n+1)/2. But we need to know rather more than that in order to obtain the probability in question.

    But I think the Bayes approach may still work. We’d like to nail down the distribution of \sum_jg_A(c_j) given that \sum_jc_j=n(n+1)/2. So we can look at \mathbb P[\sum_jg_A(c_j)=n(n+1)/2+t|\sum_jc_j=n(n+1)/2], where now the c_i are chosen uniformly and independently. Calling that \mathbb P[X|Y], we find that it’s going to be fairly easy to estimate the probabilities of X and Y. However, it doesn’t seem to be notably easier to calculate \mathbb P[Y|X] than it is to calculate \mathbb P[X|Y]. But we have made at least one huge gain, which is that now the c_j are independent, so I’d be very surprised if people don’t know how to estimate this probability. Indeed, the probability we really want to know is \mathbb P[X\wedge Y]. From that all else should follow. And I think that what we’d like is a nice condition on g_A that would tell us that the two events are approximately independent.

    I’d better stop here, but I hope I will have persuaded at least some people that there’s some reasonably low-hanging fruit around, at least for the time being.

    http://gowers.wordpress.com/?p=6233
    Extensions