A Neighborhood of Infinity

Some type constructors are tensor products

sigfpe Apr 27, 2026 Updated Apr 27, 2026

Show full content

Introduction

I want to return to something I've mentioned a couple of times in the past - the fact that applying certain type constructors performs a tensor product.

First some admin stuff:

> {-# LANGUAGE DeriveFunctor #-}
> {-# LANGUAGE FlexibleInstances #-}
> {-# LANGUAGE MultiParamTypeClasses #-}
> {-# LANGUAGE UndecidableInstances #-}
> {-# LANGUAGE TypeApplications #-}
> {-# LANGUAGE KindSignatures #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE AllowAmbiguousTypes #-}

> import Data.Proxy
> import Data.Kind (Type)

> infixr 7 ⊗

Suppose you define a type like so:

> data Complex a = C a a
>     deriving (Eq, Show, Functor)

> instance Num a => Num (Complex a) where
>     fromInteger n = C (fromInteger n) 0

>     C a b + C c d = C (a + c) (b + d)
>     C a b - C c d = C (a - c) (b - d)

>     C a b * C c d = C (a * c - b * d) (a * d + b * c)

>     negate (C a b) = C (negate a) (negate b)

>     abs    = error "abs doesn't make sense here"
>     signum = error "signum makes no sense here"

It seems straightforward. You've defined complex numbers in a way that allows a choice of base type to represent the real numbers. For example you could use Complex Float or Complex Double as representations of $\mathbb{C}$.

In actual fact you've done quite a bit more! That code has another reading - it implements a tensor product both in the category of vector spaces, and, less trivially, in the category of algebras. So if A is a suitable algebraic structure then, if you allow me to mix code and mathematics notation,

\[ \mathtt{Complex\ A} = \mathbb{C}\otimes\mathtt{A} \]

I took this for granted when I mentioned it previously but I thought I'd look into it in a little bit more detail.

Tensor Products

I want to start from the definition of the tensor product given by its universal property, but to make that slightly less fearsome I'll use an English sketch of it.

Suppose you have a pair of vector spaces $X$ and $Y$ over some base field $k$. A bilinear function $X\times Y\rightarrow Z$ is a function that is linear in $X$ and linear in $Y$. Now suppose we know that at some point in the future we are going to need some bilinear function on $X\times Y$ but don't yet know what it is. Can we make a structure, $T$, that contains precisely the information we need so that we can compute any bilinear function we want - with the proviso that we compute these bilinear functions by applying a linear function to $T$? We don't want $T$ to be lacking anything we might need to compute a future bilinear product, but we also don't want it to contain any extraneous data.

For example, imagine working with $V$, the vector space of 3D vectors. Some examples of bilinear functions we might want are the dot product $V\cdot V\rightarrow\mathbb{R}$ and the cross product $V\times V\rightarrow V$. What should $T$ look like?

We can write the dot product as $(x, y, z)\cdot(x', y', z') = xx'+yy'+zz'$. Note how it's made of products of coordinates from $(x, y, z)$ and coordinates from $(x', y', z')$. Similarly $(x, y, z)\times(x', y', z')=(yz'-zy',\ldots)$. Again, it's a linear combination of products of coordinates, one from each vector. You can prove that any bilinear product will be some linear combination of such products.

By thinking about all possible bilinear products you I hope you can see that $T$ should be a 9-dimensional vector space and a suitable way to represent a pair of vectors $(x, y, z), (x', y', z')$ for future application of a bilinear function is as $(xx', xy', xz', yx', yy', yz', zx', zy', zz')$. Any bilinear product is a linear combination of these 9 quantities and so is given by some linear operation on $T$. It is commonplace to arrange the 9-dimensional vector as a $3\times 3$ matrix in which case the map from the pair is called the outer product. But it doesn't really matter as all 9-dimensional vector spaces over a given field are isomorphic.

In this case I chose to consider bilinear functions on $V\times V$, but you can reason similarly for any pair of vector spaces $X$ and $Y$. When working with finite-dimensional vector spaces, the structure we need will be $mn$-dimensional where $m$ is the dimension of $X$ and $n$ is the dimension of $Y$. The structure is called the tensor product and is written as $X\otimes Y$. The bilinear map from the original vectors into the tensor product is also called the tensor product and as written as a binary operator $x\otimes y$. And once you have the tensor product, every bilinear function on the original pair of spaces can be expressed uniquely as a linear function on the tensor product.

So, for example, the dot product can be written as

\[ x\cdot y = \phi(x\otimes y) \]

where $(x, y, z)\otimes(x', y', z')=(xx',xy',\ldots zz')$ and so the linear function is $\phi(x_0, x_1,\ldots,x_8) = x_0+x_4+x_8$.

Similarly

\[ x\times y = \psi(x\otimes y) \]

where $\psi(x_0, x_1, \ldots, x_8)=(x_5 - x_7, x_6 - x_2, x_1 - x_3)$

Algebras

It's a confusing use of terminology, but the term "algebra (over $k$)" is used specifically to mean a vector space $A$ (over $k$) equipped with a bilinear product $A\times A\rightarrow A$ which is compatible with the vector space structure. And in addition I'm assuming my algebras contain a multiplicative unit element. Other people may call this a "unital algebra". I'll use the word "unital" when I want to stress that there is a unit.

An example is the algebra of complex numbers $\mathbb{C}$ over $\mathbb{R}$. It's a 2-dimensional vector space over $\mathbb{R}$. We can, for example, scale complex numbers by elements of the base field. We also have properties like $(au)v = u(av)$ for $a\in\mathbb{R}$ and $u,v\in\mathbb{C}$. We can scale either argument of the complex product by a real and it makes no difference which we choose. See Wikipedia for all the properties an algebra must satisfy.

Vector spaces come with an addition operation and a zero but we're going to share the work out a little differently because our Num instance already has those. So our VectorSpace class is just going to have the scale operation:

> class VectorSpace k v where
>     scale :: k -> v -> v

> instance VectorSpace Double Double where
>     scale = (*)

You can think of the definition of Complex above as a container for the coordinates in a choice of basis. Because I use deriving Functor I can get the VectorSpace instance for all similar types for free:

> instance (Functor c, VectorSpace k a) => VectorSpace k (c a) where
>     scale k = fmap (scale k)

Because fmap composes through nested functors, scale descends recursively through arbitrarily nested structures like Complex (Complex Double).

And now we can concretely implement the bilinear tensor product operation in our choice of basis. It works by descending through the construction of $x$ until it reaches its individual coordinates and then uses each one to scale $y$. A special case of this is our 9-dimensional vector construction above: each batch of 3 coordinates is s scaling of one vector by a coordinate from the other.

> (⊗) :: (Functor c, VectorSpace k a) => c k -> a -> c a
> x ⊗ y = fmap (`scale` y) x

We're literally just recursively building a table of all products of coordinates of c k and coordinates of a.

Any bilinear function f :: U -> V -> W can now be implemented as f x y = phi (x ⊗ y) for a unique choice of phi.

Algebras too

But there's more, and this is the point of me writing this article. Algebras also have a tensor product defined on them. The underlying carrier space is the tensor product of algebras considered as vector spaces. The product structure is defined by $(x\otimes y)(x'\otimes y')=(xx')\otimes(yy')$ and linear combinations thereof. But what's neat here is that we don't have to write any more code to implement this, our Num instance is already doing the work.

We need to check that our definition of Complex satisfies this property. In fact, I want to prove it more generally for any type like Complex that has a multiplication that looks like

    C a b * C c d = C (a * c - b * d) (a * d + b * c)

ie. I'll assume we have a type F that is an instance of Num, with constructor F, and whose multiplication is constructed from a linear combination of terms of the form a * a'.

Something like:

    (F ... a ...) * (F ... a' ...) = F ... (... + a * a' + ...) ...

Note that I'm claiming

\[ \mathtt{F\ A} = \mathtt{F\ Double}\otimes\mathtt{A} \]

so I can suppose that a is in Double (or whatever we use to represent the reals).

Assuming * is such a product:

   (x ⊗ y) * (x' ⊗ y')
== fmap (`scale` y) x * fmap (`scale` y') x'
   -- definition of tensor
== fmap (`scale` y) (F ... a ...) * fmap (`scale` y') (F ... a' ...)
   -- stating our assumptions about the form of x and x'
== (F ... (scale a y) ...) * (F ... (scale a' y') ...)
   -- this is what derived fmap looks like
== F ... (... + scale a y * scale a' y' + ...) ...
   -- our assumption about the form that multiplication takes
== F ... (... + scale (a * a') (y * y') + ...) ...
   -- multiplication is bilinear all the way down
== fmap (`scale` (y * y')) (F ... (... + a * a' + ...))
   -- same fact about fmap used above
== fmap (`scale` (y * y')) (x * x')
   -- again our assumption about how multiplication is implemented
== (x * x') ⊗ (y * y')
   -- definition of tensor again

Anyway, my motivation here is that quite a while back someone (on Mastodon) I think pushed back on my claim that we have a tensor product so I thought I'd give some more detail.

I could say more. The tensor product of algebras has the nice property that you can embed the original algebras in it in a way that the two images commute with each other. In fact, if you can define the tensor product to be the initial algebra with this property. But this is too long already.

Also, I used Haskell above but it carries over straightforwardly to other languages, even C++.

> main :: IO ()
> main = do
>   print "Bye!"

tag:blogger.com,1999:blog-11295132.post-5406444218917761776

Extensions

A Simple Switch Makes Code Differentiable

sigfpe Apr 12, 2026 Updated Apr 12, 2026

Show full content

Introduction One game I like to play is decomposing code and algorithms as compositions of simpler pieces even when they already seem as simple as they could be.

One example is the observation that adjoint mode automatic differentiation isn't a separate algorithm to forward mode automatic differentiation but a composition of forward mode and transposition. I talked about this in an old paper of mine Two Tricks for the Price of One and it resurfaces more recently in Jax: You Only Linearize Once.

Another example is the REDUCE algorithm used to differentiate a class of stochastic process despite making hard decisions when sampling from a distribution. It turns out this algorithm is nothing but ordinary everyday importance sampling but generalized to probabilities lying in a non-standard algebraic structure. I describe it in a blog article here and you can find out more about extending beyond the non-negative reals in a paper by Abramsky and Brandenberger. You sort of don't have to lift a finger to implement REDUCE - it just happens "for free" when you switch algebraic structure.

Here's a teeny tiny example of another "for free" method that just appears when you switch types. The Good Old Vector Space Monad First let me write the same "vector space" monad that I've written many times before around here:

> module Main where

> import qualified Data.Map.Strict as Map
> import Control.Applicative
> import Control.Monad

> newtype V s a = V { unV :: [(a, s)] }
>   deriving (Show)

> instance Functor (V s) where
>   fmap f (V xs) = V [ (f a, s) | (a, s) <- xs ]

> instance Num s => Applicative (V s) where
>   pure x = V [(x, 1)]
>   mf <*> mx = do { f <- mf; x <- mx; return (f x) }

> instance Num s => Monad (V s) where
>   (V xs) >>= f = V [ (b, s * t) |
>                      (a, s) <- xs, (b, t) <- unV (f a) ]

> instance Num s => Alternative (V s) where
>   empty = V []
>   (<|>) = add

> instance Num s => MonadPlus (V s) where
>   mzero = empty
>   mplus = (<|>)

> normalize :: (Ord a, Num s, Eq s) => V s a -> V s a
> normalize (V xs) = V $ Map.toList $
>                 Map.filter (/= 0) $ Map.fromListWith (+) xs

> scale :: Num s => s -> V s a -> V s a
> scale c (V xs) = V [ (a, c * s) | (a, s) <- xs ]

> add :: Num s => V s a -> V s a -> V s a
> add (V xs) (V ys) = V (xs ++ ys)

Dictionaries One of the challenges in machine learning is to replace standard algorithms that make hard decisions with methods that are soft and squishy so they're amenable to being differentiated. An example might be building dictionaries of key-value pairs. So let's write a line of code to insert into a dictionary:

> insert' :: [(k, v)] -> k -> v -> [(k, v)]
> insert' dict k v = (k, v) : dict

It's not very clever, but it does function as a dictionary builder. Let's write it in a more general way using the fact that the list functor is applicative:

> insert :: Alternative m => m (k, v) -> k -> v -> m (k, v)
> insert dict k v = pure (k, v) <|> dict

We can go ahead and insert some entries:

> main :: IO ()
> main = do
>     let dict1 = empty :: [(Int, Int)]
>     let dict2 = insert dict1 0 1
>     let dict3 = insert dict2 1 0
>     print dict3

DeltaNet So how can we rewrite that to make it more amenable to differentiation? We don't need to! It's already general enough. We just switch to using V instead of []:

>     let dict4 = empty :: V Double (Int, Int)
>     let dict5 = insert dict4 0 1
>     let dict6 = insert dict5 1 0
>     print $ normalize dict6

What has happened is that we've replaced the dictionary update with

D' = D + k⊗v

Addition, tensor product, these things are more amenable to differentiation than appending to a list. When using the vector space monad (or applicative) the (,) operator plays a role more like tensor product. Note that we're also no longer limited to inserting basis elements, we can use any suitably typed vectors:

>     let dict7 = dict6 <|>
>          (0.5 `scale` pure (0, 0) <|> 0.5 `scale` pure (1, 1))
>     print $ normalize dict7

This is the key ingredient in the update rule in DeltaNet, described in Parallelizing Linear Transformers with the Delta Rule over Sequence Length at the start of section 2.2. Can we do this to other algorithms These dictionaries aren't very smart. Could we do a similar trick with a binary tree, or a hierarchical binary tree of lists? One way of approaching the binary comparisons turns this into a kind of mixture of experts model, but I'm not sure it's a good place to have MoE. Maybe there's another way?

tag:blogger.com,1999:blog-11295132.post-1695252496706452621

Extensions

"What does it take to be a hero?" revisited

sigfpe Mar 30, 2026 Updated Apr 9, 2026

Show full content

Biased posteriors the hard way

I was previously interested to see how die rolls in an RPG appear when conditioned on you having survived an unlikely situation. As might have been predicted, if the die rolls contribute to that survival in a largely additive way, for example by being damage scored against a large opponent, then the posterior distribution of the rolls looks exponentially tilted.

Biased posteriors an easier way

But the only virtue of the brute force Monte Carlo method I used was that it was easy to code. It's computationally wasteful. So I wrote a much more performant DSL in Python which I have put on github.

It uses numpy to achieve tolerable numerical performance, but in addition it uses two techniques beyond brute force to make it usable.

One challenge with a probabilistic language is to manage state.

First there's state "in the past": If you're computing probabilities that are sums of many large intermediate states, for example 100 die rolls, you run the risk of running foul of combinatorial explosion.

If you write a loop like:


for i in range(100):
  t += d(6)

you want to be sure that the += operation erases history (ie. previous values of t) so you aren't tracking all 6^100 individual histories. In this case it's easy but in other cases it might not be so obvious that you have unneeded state lying around. So the code has a simple (and incomplete) backward liveness pass to insert deletions of data that won't be used again. Whenever state is deleted, you can merge histories that are now indistinguishable.

And then there is state "in the future": sometimes you'd like to compute probabilities of data structures like lists but materializing a list results in state that can cause combinatorial explosion. So I support Python style generators allowing you to generate data lazily - for example permutations of cards. So we can bring into existence state just before we need it.

Auto batching

There's another important technique I used: this code uses brute force (though at this point maybe I should stop calling it brute force) so there are many states, each corresponding to a possible set of values for some numpy objects. And we often want to perform a numpy operation for each of these values. We don't need to loop. In many cases a parameterised family of numpy operations is in fact a single numpy operation. This kind of transformation is ubiquitous in GPU computing. So we can interpret the following &&D fight, summing over the combinatorialy large number of ways it could happen, in a few seconds:

@d9.dist
def f():
    # Brachiosaurus (Monster Manual 1e p. 24)
    hp1 = lazy_sum(36 @ d(8))
    # Tyrannosaurus Rex (Monster Manual 1e p.28)
    hp2 = lazy_sum(18 @ d(8))

    for i in range(14):
        print("round", i)
        if hp1 > 0 and d(20) > 1:
            hp2 = max(0, hp2 - lazy_sum((x for x in 5 @ d(4))))

        if hp2 > 0:
            # Two claws...
            if d(20) > 1:
                hp1 -= d(6)
            if d(20) > 1:
                hp1 -= d(6)
            # ...and a bite
            if d(20) > 1:
                hp1 -= lazy_sum((x for x in 5 @ d(8)))
            hp1 = max(hp1, 0)

    win1 = hp2 == 0
    win2 = hp1 == 0

    return win1, win2

Besides its own test suite I also used a large number of questions on the RPG Stack Exchange to build a library of examples for testing.

Let's work in a general semiring

Of mathematical interest: most of the probability computations take place in a semiring. So most of the numerical computing is simply addition and multiplication. When that's all you're doing, there is the well known technique for working with large integers where you work modulo p[i] for some array of primes and only at the end reconstruct your final result using the Chinese remainder theorem. Less will known is that this works for rationals also. (I conjectured this was true, started deriving it myself, and then learnt there are published methods.) This means we can work with exact rational arithmetic using numpy without the need for a bignum library. This code turns out being related to provenance semirings as it effectively becomes a simple database tracking the provenance of each record.

I originally wrote this code to target GPUs. On my Mac, numpy turned out to be comparable in speed to PyTorch and way faster than TensorFlow. I think this is because those libraries are optimised around data of fairly fixed shape passing through fixed pipelines whereas my code is very ad hoc. I've a feeling a few custom kernels would speed it up a lot. (I may be wrong about this but I do know I can write CUDA/Metal code directly that is many times faster than some of my dice-nine examples.)

Not just Python

And one final note. This is a deeply embedded DSL. I use Python as a host to give me an AST that I interpret. This isn't simply overloading of Python operators.

tag:blogger.com,1999:blog-11295132.post-5093748306360724234

Extensions

What does it take to be a hero? and other questions from statistical mechanics.

sigfpe Sep 15, 2024 Updated Sep 16, 2024

Show full content

1 We only hear about the survivors

In the classic Star Trek episode Errand of Mercy, Spock computes the chance of success:

CAPTAIN JAMES T. KIRK : What would you say the odds are on our getting out of here?

MR. SPOCK : Difficult to be precise, Captain. I should say, approximately 7,824.7 to 1.

And yet they get out of there. Are Spock’s probability computations unreliable? Think of it another way. The Galaxy is a large place. There must be tens of thousands of Spocks, and Grocks, and Plocks out there on various missions. But we won’t hear (or don’t want to hear) about the failures. So they may all be perfectly good at probability theory, but we’re only hearing about the lucky ones. This is an example of survivor bias.

2 Simulation

We can model this. I’ve written a small battle simulator for a super-simple made up role-playing game...

And the rest of this article can be found at github

(Be sure to download the actual PDF if you want to be able to follow links.)

tag:blogger.com,1999:blog-11295132.post-6427224712339172460

Extensions

How to hide information from yourself in a solo RPG

sigfpe Sep 7, 2024 Updated Sep 7, 2024

Show full content

A more stable version of this article can be found on github.The ProblemSince the early days of role-playing games there has been debate over which rolls the GM should make and which are the responsibility of the players. But I think that for “perception” checks it doesn’t really make sense for a player to roll. If, as a player, you roll to hear behind a door and succeed, but you’re told there is no sound, then you know there is nothing to be heard. But you ought to just be left in suspense.
If you play a solo RPG the situation is more challenging. If there is a probability p of a room being occupied, and probability q of you hearing the occupant if you listen at the door, how can you simulate listening without making a decision about whether the room is occupied before opening the door? I propose a little mathematical trick.

Helena Listening, by Arthur Rackham
Simulating conditional probabilitiesSuppose P(M) = p and P(H|M) = q (and P(H|not M) = 0). Then P(H) = pq. So to simulate the probability of hearing something at a new door: roll to see if a monster is present, and then roll to hear it. If both come up positive then you hear a noise.
But...but...you object, if the first roll came up positive you know there is a monster, removing the suspense if the second roll fails. Well this process does produce the correct (marginal) probability of hearing a noise at a fresh door. So you reinterpret the first roll not as determining whether a monster is present, but as just the first step in a two-step process to determine if a sound is heard.
But what if no sound is heard and we decide to open the door? We need to reduce the probability that we find a monster behind the door. In fact we need to sample P(M|not H). We could use Bayes’ theorem to compute this but chances are you won’t have any selection of dice that will give the correct probability. And anyway, you don’t want to be doing mathematics in the middle of a game, do you? There’s a straightforward trick. In the event that you heard no noise at the door and want to now open the door: roll (again) to see if there is a monster behind the door, and then roll to listen again. If the outcome of the two rolls matches the information that you know, ie. it predicts you hear nothing, then you can now accept the first roll as determining whether the monster is present. In that case the situation is more or less vacuously described by P(M|not H). If the two rolls disagree with what you know, ie. they predict you hear something, then repeat the roll of two dice. Keep repeating until it agrees with what you know. In generalThere is a general method here though it’s only practical for simple situations. If you need to generate some hidden variables as part of a larger procedure, just generate them as usual, keep the variables you observe, and discard the hidden part. If you ever need to generate those hidden variables again, and remain consistent with previous rolls, resimulate from the beginning, restarting the rolls if they ever disagree with your previous observations.
In principle you could even do something like simulate an entire fight against a creature whose hit points remain unknown to you. But you’ll spend a lot of time rerolling the entire fight from the beginning. So It’s better for situations that only have a small number of steps, like listening at a door.

tag:blogger.com,1999:blog-11295132.post-4591399486981749007

Extensions

What does it mean for a monad to be strong?

sigfpe Aug 21, 2023 Updated Aug 21, 2023

Show full content

This is something I put on github years ago but I probably should have put it here.

Here's an elementary example of the use of the list monad:

> test1 = do
>   x <- [1, 2]
>   y <- [x, 10*x]
>   [x*y]

We can desugar this to:

> test2 = [1, 2] >>= \x -> [x, 10*x] >>= \y -> [x*y]

It looks like we start with a list and then apply a sequence (of length 2) of functions to it using bind (>>=). This is probably why some people call monads workflows and why the comparison has been made with Unix pipes.

But looks can be deceptive. The operator (>>=) is right associative and test2 is the same as test3:

> test3 = [1, 2] >>= (\x -> [x, 10*x] >>= \y -> [x*y])

You can try to parenthesise the other way:

> -- test4 = ([1, 2] >>= \x -> [x, 10*x]) >>= \y -> [x*y]

We get a "Variable not in scope: x" error. So test1 doesn't directly fit the workflow model. When people give examples of how workflow style things can be seen as monads they sometimes use examples where later functions don't refer to variables defined earlier. For example at the link I gave above the line m >>= x-> (n >>= y-> o) is transformed to (m >>= x-> n) >>= y-> o which only works if o makes no mention of x. I found similar things to be true in a number of tutorials, especially the ones that emphasise the Kleisli category view of things.

But we can always "reassociate" to the left with a little bit of extra work. The catch is that the function above defined by y-> ... "captures" x from its environment. So it's not just one function, it's a family of functions parameterised by x. We can fix this by making the dependence on x explicit. We can then pull the inner function out as it's no longer implicitly dependent on its immediate context. When compilers do this it's called lambda lifting.

Define (the weirdly named function) strength by

> strength :: Monad m => (x, m y) -> m (x, y)
> strength (x, my) = do
>   y <- my
>   return (x, y)

It allows us to smuggle x "into the monad".

And now we can rewrite test1, parenthesising to the left:

> test5 = ([1, 2] >>= \x -> strength (x, [x, 10*x])) >>= \(x, y) -> [x*y]

This is much more like a workflow. Using strength we can rewrite any (monadic) do expression as a left-to-right workflow, with the cost of having to throw in some applications of strength to carry along all of the captured variables. It's also using a composition of arrows in the Kleisli category.

A monad with a strength function is called a strong monad. Clearly all Haskell monads are strong as I wrote strength to work with any Haskell monad. But not all monads in category theory are strong. It's a sort of hidden feature of Haskell (and the category Set) that we tend not to refer to explicitly. It could be said that we're implicitly using strength whenever we refer to earlier variables in our do expressions.

See also nlab.

> main = do
>   print test1
>   print test2
>   print test3
>   -- print test4
>   print test5

tag:blogger.com,1999:blog-11295132.post-3424295895362574303

Extensions

Constructing Clifford Algebras using the Super Tensor Product

sigfpe Mar 5, 2023 Updated Apr 23, 2026

Show full content

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

Some literate Haskell but little about this code is specific to Haskell...

> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE UndecidableInstances #-}
> 
> import GHC.TypeLits

Introduction

This is a followup to Geometric Algebra for Free and More Low Cost Geometric Algebra.

In those articles I showed how you could build up the Clifford algebras like so:

type Cliff1  = Complex R
type Cliff1' = Split R
type Cliff2  = Quaternion R
type Cliff2' = Matrix R
type Cliff3  = Quaternion Cliff1'
type Cliff3' = Matrix Cliff1
type Cliff4  = Quaternion Cliff2'
type Cliff4' = Matrix Cliff2
type Cliff5  = Quaternion Cliff3'
...

I used CliffN as the Clifford algebra for a negative definite inner product and CliffN' for the positive definite case. It's not a completely uniform sequence in the sense that CliffN is built from CliffN' for dimension two lower and you use a mix of Matrix and Quaternion.

The core principle making this work is that for type constructors $T$ implemented like Matrix, Quaternion etc. we have the property that

$TU\mathbb{R} = T\mathbb{R}\otimes U\mathbb{R}$

eg. Matrix (Quaternion Float) is effectively the same thing as Matrix Float $\otimes$ Quaternion Float.

But John Baez pointed out to me that you can build up the CliffN algebras much more simply enabling us to use these definitions:

> type Cliff1 = Complex Float
> type Cliff2 = Complex Cliff1
> type Cliff3 = Complex Cliff2
> type Cliff4 = Complex Cliff3
> type Cliff5 = Complex Cliff4

...

Or even better:

> type family Cliff (n :: Nat) :: * where
>   Cliff 0 = Float
>   Cliff n = Complex (Cliff (n - 1))

But there's one little catch. We have to work, not with the tensor product, but the super tensor product.

We define Complex the same way as before:

> data Complex a = C a a deriving (Eq, Show)

Previously we used a definition of multiplication like this:

instance Num a => Num (Complex a) where
  C a b * C c d = C (a * c - b * d) (a * d + b * c)

We can think of C a b in Complex R as representing the element $1\otimes a+i\otimes b$. The definition of multiplication in a tensor product of algebras is \[(a\otimes b)(c\otimes d)=(ac)\otimes(bd).\] So we have \[(1\otimes a+i\otimes b)(1\otimes c+i\otimes d)\] \[=1\otimes ac+i\otimes ad+i\otimes bc+i^2\otimes bd\] \[=1\otimes(ac-bd)+i\otimes(ad+bc).\]

This means that line of code we wrote above defining * for Complex isn't simply a definition of multiplication of complex numbers, it says how to multiply in an algebra tensored with the complex numbers.

Let's go Super!

A superalgebra is an algebra graded by $\mathbb{Z}_2$ where $\mathbb{Z}_2$ is the ring of integers modulo 2. What that means is that we have some algebra $A$ that can be broken down as a direct sum $A_0\oplus A_1$ (the subscripts live in $\mathbb{Z}_2$ ) with the property that multiplication respects the grading, ie. if $x$ is in $A_i$ and $y$ is in $A_j$ then $xy$ is in $A_{i+j}$ .

The elements of $A_0$ are called "even" (or bosonic) and those in $A_1$ "odd" (or fermionic). Often even elements commute with everything and odd elements anticommute with each other but this isn't always the case. (The superalgebra is said to be supercommutative when this happens. This is a common pattern: a thing X becomes a superX if it has odd and even parts and swapping two odd things introduces a sign flip.)

The super tensor product is much like the tensor product but it respects the grading. This means that if $x$ is in $A_i$ and $y$ is in $B_j$ then $x\otimes y$ is in $(A\otimes B)_{i+j}$ . From now on I'm using $\otimes$ to mean super tensor product.

Multiplication in the super tensor product of two superalgebras $A$ and $B$ is now defined by the following modified rule: if $b$ is in $B_i$ and $c$ is in $A_j$ then $(a\otimes b)(c\otimes d) = (-1)^{ij}(ac)\otimes(bd)$ . Note that the sign flip arises when we shuffle an odd $c$ left past an odd $b$ .

The neat fact that John pointed out to me is that \[Cliff_n=\mathbb{C}\otimes\mathbb{C}\otimes\ldots\text{ n times }\ldots\otimes\mathbb{C}.\]

We have to modify our definition of * to take into account that sign flip.

I initially wrote a whole lot of code to define a superalgebra as a pair of algebras with four multiplication operations and it got a bit messy. But I noticed that the only specifically superalgebraic operation I ever performed on an element of a superalgebra was negating the odd part of an element.

So I could define SuperAlgebra like so:

class SuperAlgebra a where
  conjugation :: a -> a

where conjugation is the negation of the odd part.

(I'm not sure if this operation corresponds to what is usually called conjugation in this branch of mathematics.)

But there's a little efficiency optimization I want to write. If I used the above definition, then later I'd often find myself computing a whole lot of negates in a row. This means applying negate to many elements of large algebraic objects even though any pair of them cancel each other's effect. So I add a little flag to my conjugation function that is used to say we want an extra negate and we can accumulate flips of a flag rather than flips of lots of elements.

> class SuperAlgebra a where
>   conjugation :: Bool -> a -> a

Here's our first instance:

> instance SuperAlgebra Float where
>   conjugation False x = x
>   conjugation True x = negate x

This is saying that the conjugation is the identity on Float but if we want to perform an extra flip we can set the flag to True. Maybe I should call it conjugationWithOptionalExtraNegation.

And now comes the first bit of non-trivial superalgebra:

> instance (Num a, SuperAlgebra a) => SuperAlgebra (Complex a) where
>   conjugation e (C a b) = C (conjugation e a) (conjugation (not e) b)

We consider $1$ to be even and $i$ to be odd. When we apply the conjugation to $1\otimes a+i\otimes b$ then we can just apply it directly to $a$ . But that $i\otimes$ flips the "parity" of $b$ (because tensor product respects the grading) so we need to swap when we use the conjugation. And that should explain why conjugation is defined the way it is.

Now we can use the modified rule for $\mathbb{C}\otimes$ defined above:

> instance (Num a, SuperAlgebra a) => Num (Complex a) where
>   fromInteger n = C (fromInteger n) 0
>   C a b + C a' b' = C (a + a') (b + b')
>   C a b * C c d = C (a * c - conjugation False b * d)
>                     (conjugation False a * d + b * c) 
>   negate (C a b) = C (negate a) (negate b)
>   abs = undefined
>   signum = undefined

For example, conjugation False is applied to the first $b$ on the RHS because $d$ implicitly represents an $id$ term and when expanding out the product we shuffle the (odd) $i$ in $id$ left of $b$ . It doesn't get applied to the second $ib$ because $ib$ and $c$ remain in the same order.

That's it!

Tests

I'll test it with some examples from Cliff3:

> class HasBasis a where
>   e :: Integer -> a


> instance HasBasis Float where
>   e = undefined


> instance (Num a, HasBasis a) => HasBasis (Complex a) where
>   e 0 = C 0 1
>   e n = C (e (n - 1)) 0


> make a b c d e f g h =
>   C (C (C a b) (C c d))
>     (C (C e f) (C g h))


> e1, e2, e3, e21, e31, e32, e321 :: Cliff 3
> e1 = e 0
> e2 = e 1
> e21 = e2 * e1
> e3 = e 2
> e31 = e3 * e1
> e32 = e3 * e2
> e321 = e3 * e2 * e1


> main = do
>     print (e1 * e1 + 1 == 0)
>     print (e31 * e31 + 1 == 0)
>     print (e3 * e3 + 1 == 0)
>     print (e21 * e21 + 1 == 0)
>     print (e2 * e2 + 1 == 0)
>     print (e32 * e32 + 1 == 0)
>     print (e321 * e321 - 1 == 0)
>     print (e3 * e2 * e1 - e321 == 0)
>     print (e2 * e1 - e21 == 0)
>     print (e3 * e1 - e31 == 0)
>     print (e3 * e2 - e32 == 0)
>     print (e21 * e32 - e31 == 0)

Observation

The implementation of multiplication looks remarkably like it's the Cayley-Dickson construction. It can't be (because iterating it three times gives you a non-associative algebra but the Clifford algebras are associative). Nonetheless, I think comparison with Cayley-Dickson may be useful.

Efficiency

As mentioned above, before I realised I just needed the conjugation operation I wrote the above code with an explicit split of a superalgebra into two pieces intertwined by four multiplications. I think the previous approach may have a big advantage - it may be possible to use variations on the well known "speed-up" of complex multiplication that uses three real multiplications instead of four. This should lead to a fast implementation of Clifford algebras.

Also be warned: you can kill GHC if you turn on optimization and try to multiply elements of high-dimensional Clifford algebras. I think it tries to inline absolutely everything and you end up with a block of code that grows exponentially with $n$ .

Note also that this code translates directly into many languages.

tag:blogger.com,1999:blog-11295132.post-4271990394665143853

Extensions

Some pointers to things not in this blog

sigfpe Sep 5, 2020 Updated Sep 5, 2020

Show full content

Some pointers to things not in this blog

One reason I haven't blogged much recently is that my tolerance for blogger.com has reached its limit and I've been too lazy to build my own platform supporting mathematics and code. (For example, I can't get previewing on blogger to work today so I'm just publishing this and hope the reformatting is acceptable.) But that doesn't mean I haven't posted stuff publicly. So here are some thematically related links to things I've written on github and colab.

Continuations, effects and runners

How to slice your code into continuations

Handling Effects with Jax

Are these runners?
Just a little snippet of code to illustrate how Python's coroutines can be used to support composable runners. See http://math.andrej.com/2019/10/28/runners-in-action/
(The answer is yes.)

Parallel audioParallel audio synthesis
Simulating Plucked Strings with TensorFlow

FWIW I think Colab might be my favourite place to share stuff publicly if it supported environments other than Python.

tag:blogger.com,1999:blog-11295132.post-2710424073777324923

Extensions

Why is nuclear fusion so hard?

sigfpe Dec 8, 2018 Updated Dec 8, 2018

Show full content

Why does water fall out of an inverted cup? Before considering nuclear fusion, let's consider something much more familiar. If you turn a cup full of water upside down, the water falls out. Why is this? It seems obvious: with nothing supporting the water from below, gravity pulls the water out of the cup. But let's look a little closely at the left side of Figure 1. We have water in an inverted cup under the influence of gravity. But there is air pressure at the surface of the water. At sea level, atmospheric pressure is about the weight of 1kg per cm2, easily enough to keep the water in the cup. So now we still have our question: why does it come out?

Consider the diagram on the right of Figure 1. I have drawn some ripples on the surface of the water. Air pressure provides a force perpendicular to the water surface which means that around the ripples we no longer have a force pointing straight up. The force points partly sideways and this serves to deform the shape of the water surface. But as the water surface becomes even more deformed the forces become even more distorted away from vertical causing a feedback loop. So we can expect even the tiniest of ripples to grow to the point where the water completely changes shape and it eventually deforms its way out of the glass.

Figure 1

This is an example of an instability - in this case it's called the Rayleigh-Taylor instability. And this suggests an approach to keeping water in an inverted cup - suppress the ripples. One way to do this is to fill the cup to the top and place a piece of paper over the surface. Here is an example on YouTube:

Another approach is to apply thickening agents to the water to dampen the ripples. Here's an example showing how hard it is to get jello out of a bowl:
Nuclear fusion And now I can discuss one of the challenges facing controlled nuclear fusion. When two suitable atomic nuclei are made to fuse they release energy. But nuclei are positively charged and repel. So in order to cause nuclei to fuse we need to overcome this repulsion, typically by making them collide at very high velocities. To achieve this in useful numbers one way is to heat a suitable fuel to a high temperature in a small enough volume. Temperatures of at least 10,000,000K are required, and possibly orders of magnitude higher. We know how to do the heating, but we also have to confine the fuel - which at these temperatures takes the form of a plasma because its nuclei and electrons become separated. We can't simply use a solid container to hold the plasma as no container can survive such temperatures. Instead we need some other confinement method such as magnetic confinement, where we use magnetic fields to control the plasma, or inertial confinement where we use explosions around the fuel to drive the fuel into a small volume. Both of these suffer from the same problem: we're using a force to control a fluid and this is subject to instabilities. Confining a plasma is like trying to keep water in an inverted cup by blowing on the water with fans. It's really hard. What's more, plasma has incredibly complex dynamics that is subject to many different kinds of instability. Around fifty named instabilities are listed on Wikipedia. Here is an example of one in a real plasma. It is a kink instability that causes a line of plasma with a current through it to start acquiring a corkscrew shape so that it eventually tears itself apart:

And that's one of the biggest challenges facing fusion energy today: we don't know how to keep a plasma stable for long enough. Is there any hope for fusion? The description above is qualitative. To determine how severe any particular instability is we need to run experiments or use some theory. Some kinds of instability are slow enough that we can control them. For example bicycles are unstable, but many of us eventually learn to ride them. In a thermonuclear weapon stability is only required for a short amount of time, something that was achievable back in the 50s. And sometimes the feedback loops in the physics are less troublesome than might be expected so that Tokamaks (and other configurations) have operating modes that turn out to be relatively stable, for example the "H-mode". So maybe we'll have working fusion reactors in 20 years, or 30 years, or is it 50?

tag:blogger.com,1999:blog-11295132.post-7451252178140834190

Extensions

Running from the past

sigfpe Oct 20, 2018 Updated Apr 29, 2025

Show full content

Important Note

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

Preface

Functional programming encourages us to program without mutable state. Instead we compose functions that can be viewed as state transformers. It's a change of perspective that can have a big impact on how we reason about our code. But it's also a change of perspective that can be useful in mathematics and I'd like to give an example: a really beautiful technique that alows you to sample from the infinite limit of a probability distribution without needing an infinite number of operations. (Unless you're infinitely unlucky!)

Markov Chains

A Markov chain is a sequence of random states where each state is drawn from a random distribution that possibly depends on the previous state, but not on any earlier state. So it is a sequence $X_0, X_1, X_2, \ldots$ such that $P(X_{i+1}=x|X_0,X_1,\ldots,X_i) = P(X_{i+1}=x|X_i)$ for all $i\ge0$ . A basic example might be a model of the weather in which each day is either sunny or rainy but where it's more likely to be rainy (or sunny) if the previous day was rainy (or sunny). (And to be technically correct: having information about two days or earlier doesn't help us if we know yesterday's weather.)

Like imperative code, this description is stateful. The state at step $i+1$ depends on the state at step $i$ . Probability is often easier to reason about when we work with independent identically drawn random variables and our $X_i$ aren't of this type. But we can eliminate the state from our description using the same method used by functional programmers.

Let's choose a Markov chain to play with. I'll pick one with 3 states called $A$ , $B$ and $C$ and with transition probabilities given by $P(X_{i+1}=y|X_i=x)=T_{xy}$ where $T=\begin{pmatrix} \frac{1}{2}& \frac{1}{2}& 0\\ \frac{1}{3}& \frac{1}{3}& \frac{1}{3}\\ 0& \frac{1}{2}& \frac{1}{2}\\ \end{pmatrix}$

Here's a diagram illustrating our states:

Implementation

First some imports:

> {-# LANGUAGE LambdaCase #-}
> {-# LANGUAGE TypeApplications #-}


> import Data.Sequence(replicateA)
> import System.Random
> import Control.Monad.State
> import Control.Monad
> import Data.List
> import Data.Array

And now the type of our random variable:

> data ABC = A | B | C deriving (Eq, Show, Ord, Enum, Bounded)

We are now in a position to simulate our Markov chain. First we need some random numbers drawn uniformly from [0, 1]:

> uniform :: (RandomGen gen, MonadState gen m) => m Double
> uniform = state random

And now the code to take a single step in the Markov chain:

> step :: (RandomGen gen, MonadState gen m) => ABC -> m ABC
> step A = do
>     a <- uniform
>     if a < 0.5
>         then return A
>         else return B
> step B = do
>     a <- uniform
>     if a < 1/3.0
>         then return A
>         else if a < 2/3.0
>             then return B
>             else return C
> step C = do
>     a <- uniform
>     if a < 0.5
>         then return B
>         else return C

Notice how the step function generates a new state at random in a way that depends on the previous state. The m ABC in the type signature makes it clear that we are generating random states at each step.

We can simulate the effect of taking $n$ steps with a function like this:

> steps :: (RandomGen gen, MonadState gen m) => Int -> ABC -> m ABC
> steps 0 i = return i
> steps n i = do
>     i <- steps (n-1) i
>     step i

We can run for 100 steps, starting with $A$ , with a line like so:

*Main> evalState (steps 3 A) gen
B

The starting state of our random number generator is given by gen.

Consider the distribution of states after taking $n$ steps. For Markov chains of this type, we know that as $n$ goes to infinity the distribution of the $n$ th state approaches a limiting "stationary" distribution. There are frequently times when we want to sample from this final distribution. For a Markov chain as simple as this example, you can solve exactly to find the limiting distribution. But for real world problems this can be intractable. Instead, a popular solution is to pick a large $n$ and hope it's large enough. As $n$ gets larger the distribution gets closer to the limiting distribution. And that's the problem I want to solve here - sampling from the limit. It turns out that by thinking about random functions instead of random states we can actually sample from the limiting distribution exactly.

Some random functions

Here is a new version of our random step function:

> step' :: (RandomGen gen, MonadState gen m) => m (ABC -> ABC)
> step' = do
>     a <- uniform
>     return $ \case
>         A -> if a < 0.5 then A else B
>         B -> if a < 1/3.0
>                 then A
>                 else if a < 2/3.0 then B else C
>         C -> if a < 0.5 then B else C

In many ways it's similar to the previous one. But there's one very big difference: the type signature m (ABC -> ABC) tells us that it's returning a random function, not a random state. We can simulate the result of taking 10 steps, say, by drawing 10 random functions, composing them, and applying the result to our initial state:

> steps' :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)
> steps' n = do
>   fs <- replicateA n step'
>   return $ foldr (flip (.)) id fs

Notice the use of flip. We want to compose functions $f_{n-1}\circ f_{n-2}\circ\ldots\circ f_0$ , each time composing on the left by the new $f$ . This means that for a fixed seed gen, each time you increase $n$ by 1 you get the next step in a single simulation: (BTW I used replicateA instead of replicateM to indicate that these are independent random draws. It may be well known that you can use Applicative instead of Monad to indicate independence but I haven't seen it written down.)

*Main> [f A | n <- [0..10], let f = evalState (steps' n) gen]
[A,A,A,B,C,B,A,B,A,B,C]

When I first implemented this I accidentally forgot the flip. So maybe you're wondering what effect removing the flip has? The effect is about as close to a miracle as I've seen in mathematics. It allows us to sample from the limiting distribution in a finite number of steps!

Here's the code:

> steps_from_past :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)
> steps_from_past n = do
>   fs <- replicateA n step'
>   return $ foldr (.) id fs

We end up building $f_{0}$ $\circ$ $f_{1}$ $\ldots$ $\circ$ $f_{n-1}$ . This is still a composition of $n$ independent identically distributed functions and so it's still drawing from exactly the same distribution as steps'. Nonetheless, there is a difference: for a particular choice of seed, steps_from_past n no longer gives us a sequence of states from a Markov chain. Running with argument $n$ draws a random composition of $n$ functions. But if you increase $n$ by 1 you don't add a new step at the end. Instead you effectively restart the Markov chain with a new first step generated by a new random seed.

Try it and see:

*Main> [f A | n <- [0..10], let f = evalState (steps_from_past n) gen]
[A, A, A, A, A, A, A, A, A, A]

Maybe that's surprising. It seems to get stuck in one state. In fact, we can try applying the resulting function to all three states.

*Main> [fmap f [A, B, C] | n <- [0..10], let f = evalState (steps_from_past n) gen]
[[A,B,C],[A,A,B],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A]]

In other words, for $n$ large enough we get the constant function.

Think of it this way: If f isn't injective then it's possible that two states get collapsed to the same state. If you keep picking random f's it's inevitable that you will eventually collapse down to the point where all arguments get mapped to the same state. Once this happens, we'll get the same result no matter how large we take $n$ . If we can detect this then we've found the limit of $f_{0}$ $\circ$ $f_{1}$ $\ldots$ $\circ$ $f_{n-1}$ as $n$ goes to infinity. But because we know composing forwards and composing backwards lead to draws from the same distribution, the limiting backward composition must actually be a draw from the same distribution as the limiting forward composition. That flip can't change what probability distribution we're drawing from - just the dependence on the seed. So the value the constant function takes is actually a draw from the limiting stationary distribution.

We can code this up:

> all_equal :: (Eq a) => [a] -> Bool
> all_equal [] = True
> all_equal [_] = True
> all_equal (a : as) = all (== a) as


> test_constant :: (Bounded a, Enum a, Eq a) => (a -> a) -> Bool
> test_constant f =
>     all_equal $ map f $ enumFromTo minBound maxBound

This technique is called coupling from the past. It's "coupling" because we've arranged that different starting points coalesce. And it's "from the past" because we're essentially asking answering the question of what the outcome of a simulation would be if we started infinitely far in the past.

> couple_from_past :: (RandomGen gen, MonadState gen m, Enum a, Bounded a, Eq a) =>
>                   m (a -> a) -> (a -> a) -> m (a -> a)
> couple_from_past step f = do
>     if test_constant f
>         then return f
>         else do
>             f' <- step
>             couple_from_past step (f . f')

We can now sample from the limiting distribution a million times, say:

*Main> let samples = map ($ A) $ evalState (replicateA 1000000 (couple_from_past step' id)) gen

We can now count how often A appears:

*Main> fromIntegral (length $ filter (== A) samples)/1000000
0.285748

That's a pretty good approximation to $\frac{2}{7}$ , the exact answer that can be found by finding the eigenvector of the transition matrix corresponding to an eigenvalue of 1.

> gen = mkStdGen 669

Notes

The technique of coupling from the past first appeared in a paper by Propp and Wilson. The paper Iterated Random Functions by Persi Diaconis gave me a lot of insight into it. Note that the code above is absolutely not how you'd implement this for real. I wrote the code that way so that I could switch algorithm with the simple removal of a flip. In fact, with some clever tricks you can make this method work with state spaces so large that you couldn't possibly hope to enumerate all starting states to detect if convergence has occurred. Or even with uncountably large state spaces. But I'll let you read the Propp-Wilson paper to find out how.

tag:blogger.com,1999:blog-11295132.post-3498877547203793692

Extensions

A tail we don't need to wag

sigfpe Oct 14, 2017 Updated Oct 14, 2017

Show full content

Introduction

I've been reading a little about concentration inequalities recently. I thought it would be nice to see if you can use the key idea, if not the actual theorems, to reduce the complexity of computing the probability distribution of the outcome of stochastic simulations. Examples might include random walks, or queues.

The key idea behind concentration inequalities is that very often most of the probability is owned by a small proportion of the possible outcomes. For example, if we toss a fair coin enough (say $N$ ) times we expect the number of heads to lie within $2\sqrt{N}$ of the mean about 99.99% of the time despite there being $N+1$ different total numbers possible. The probable outcomes tend to concentrate around the expectation. On the other hand, if we consider not the total number of heads, but the possible sequences of $N$ tosses, there are $2^N$ possibilities, all equally likely. In this case there is no concentration. So a key ingredient here is a reduction operation: in this case reducing a sequence of tosses to a count of the number that came up heads. This is something we can use in a computer program.

I (and many others) have written about the "vector space" monad that can be used to compute probability distributions of outcomes of simulations and I'll assume some familiarity with that. Essentially it is a "weighted list" monad which is similar to the list monad except that in addition to tracking all possible outcomes, it also propagates a probability along each path. Unfortunately it needs to follow through every possible path through a simulation. For example, in the case of simulating $N$ coin tosses it needs to track $2^N$ different possiblities, even though we're only interested in the $N+1$ possible sums. If, after each bind operation of the monad, we could collect together all paths that give the same total then we could make this code much more efficient. The catch is that to collect together elements of a type the elements need to be comparable, for example instances of Eq or Ord. This conflicts with the type of Monad which requires that we can use the >>= :: m a -> (a -> m b) -> m b and return :: a -> m a functions with any types a and b.

I'm going to deal with this by adapting a technique presented by Oleg Kiselyov for efficiently implementing the Set monad. Instead of Set I'm going to use the Map type to represent probability distributions. These will store maps saying, for each element of a type, what the probability of that element is. So part of my code is going to be a direct translation of that code to use the Map type instead of the Set type.

> {-# LANGUAGE GADTs, FlexibleInstances #-}
> {-# LANGUAGE ViewPatterns #-}


> module Main where


> import Control.Monad
> import Control.Arrow
> import qualified Data.Map as M
> import qualified Data.List as L

The following code is very similar to Oleg's. But for first reading I should point out some differences that I want you to ignore. The type representing a probability distribution is P:

> data P p a where
>     POrd :: Ord a => p -> M.Map a p -> P p a
>     PAny :: p -> [(a, p)] -> P p a

But note how the constructors take two arguments - a number that is a probability, in addition to a weighted Map or list. For now pretend that first argument is zero and that the functions called trimXXX act similarly to the identity:

> instance (Ord p, Num p) => Functor (P p) where
>     fmap = liftM


> instance (Ord p, Num p) => Applicative (P p) where
>     pure = return
>     (<*>) = ap


> instance (Ord p, Num p) => Monad (P p) where
>     return x = PAny 0 [(x, 1)]
>     m >>= f = 
>         let (e, pdf) = unP m
>         in trimAdd e $ collect $ map (f *** id) pdf


> returnP :: (Ord p, Num p, Ord a) => a -> P p a
> returnP a = POrd 0 $ M.singleton a 1


> unP :: P p a -> (p, [(a, p)])
> unP (POrd e pdf) = (e, M.toList pdf)
> unP (PAny e pdf) = (e, pdf)


> fromList :: (Num p, Ord a) => [(a, p)] -> M.Map a p
> fromList = M.fromListWith (+)


> union :: (Num p, Ord a) => M.Map a p -> M.Map a p -> M.Map a p
> union = M.unionWith (+)


> scaleList :: Num p => p -> [(a, p)] -> [(a, p)]
> scaleList weight = map (id *** (weight *))


> scaleMap :: (Num p, Ord a) => p -> M.Map a p -> M.Map a p
> scaleMap weight = fromList . scaleList weight . M.toList

This is a translation of Oleg's crucial function that allows us to take a weighted list of probability distributions and flatten them down to a single probability distribution:

> collect :: Num p => [(P p a, p)] -> P p a
> collect []  = PAny 0 []
> collect ((POrd e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleMap weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` pdf1
>       PAny e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` fromList pdf1
> collect ((PAny e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleList weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ fromList wpdf0 `union` pdf1
>       PAny e1 pdf1 -> PAny (weight*e0+e1) $ wpdf0 ++ pdf1

But now I really must explain what the first argument to POrd and PAny is and why I have all that "trimming".

Even though the collect function allows us to reduce the number of elements in our PDFs, we'd like to take advantage of concentration of probability to reduce the number even further. The trim function keeps only the top $N$ probabilities in a PDF, discarding the rest. To be honest, this is the only point worth taking away from what I've written here :-)

When we throw away elements of the PDF our probabilities no longer sum to 1. So I use the first argument of the constructors as a convenient place to store the amount of probability that I've thrown away. The trim function keeps the most likely $N$ outcomes and sums the probability of the remainder. I don't actually need to keep track of what has been discarded. In principle we could reconstruct this value by looking at how much the probabilities in our trimmed partial PDFs fall short of summing to 1. But confirming that our discarded probability and our partial PDF sums to 1 gives a nice safety check for our code and can give us some warning if numerical errors start creeping in. I'll call the total discarded probability the tail probability.

Here is the core function to keep the top $N$ values. In this case $N$ is given by a global constant called trimSize. (I'll talk about how to do this better later.)

> trimList :: (Ord p, Num p) => [(a, p)] -> (p, [(a, p)])
> trimList ps =
>     let (keep, discard) = L.splitAt trimSize (L.sortOn (negate . snd) ps)
>     in (sum (map snd discard), keep)


> trimAdd :: (Ord p, Num p) => p -> P p a -> P p a
> trimAdd e' (POrd e pdf) =
>     let (f, trimmedPdf) = trimList (M.toList pdf)
>     in POrd (e'+e+f) (M.fromList trimmedPdf)
> trimAdd e' (PAny e pdf) =
>     let (f, trimmedPdf) = trimList pdf
>     in PAny (e'+e+f) trimmedPdf


> runP :: (Num p, Ord a) => P p a -> (p, M.Map a p)
> runP (POrd e pdf) = (e, pdf)
> runP (PAny e pdf) = (e, fromList pdf)

And now some functions representing textbook probability distributions. First the uniform distribution on a finite set. Again this is very similar to Oleg's chooseOrd function apart from the fact that it assigns weights to each element:

> chooseP :: (Fractional p, Ord p, Ord a) =>
>            [a] -> P p a
> chooseP xs = let p = 1/fromIntegral (length xs)
>              in POrd 0 $ fromList $ map (flip (,) p) xs

And the Bernoulli distribution, i.e. tossing a Bool coin that comes up True with probability $p$ :

> bernoulliP :: (Fractional p, Ord p) =>
>               p -> P p Bool
> bernoulliP p = POrd 0 $ fromList $ [(False, 1-p), (True, p)]

Now we can try a random walk in one dimension. At each step we have a 50/50 chance of standing still or taking a step to the right:

> random_walk1 :: Int -> P Double Int
> random_walk1 0 = returnP 0
> random_walk1 n = do
>     a <- random_walk1 (n-1)
>     b <- chooseP [0, 1]
>     returnP $ a+b

Below in main we take 2048 steps but only track 512 probabilities. The tail probability in this case is about $10^{-29}$ . So only tracking 1/4 of the outcomes has had almost no impact on the numbers. This also illustrates why it is good to track the tail probabilities rather than inferring them from the missing probabilities in the bulk of the PDF - they can be so small they vanish compared to floating poimnt errors. We can afford to track a lot fewer than 512 (out of 2049 possible) outcomes and still have a good representative PDF.

Now here's a two-dimensional random walk for 32 steps. The tail probability is about $2\times 10^{-5}$ so we are getting a reasonably representative PDF. We have to run fewer steps than before, however, because the space of possible outcomes spans two dimensions, meaning that reduction doesn't help as much as it does in one dimension.

> random_walk2 :: Int -> (Int, Int) -> P Double (Int, Int)
> random_walk2 0 (x, y) = returnP (x, y)
> random_walk2 n (x, y) = do
>     (x',y') <- random_walk2 (n-1) (x, y)
>     dx <- chooseP [-1, 1]
>     dy <- chooseP [-1, 1]
>     returnP (x'+dx, y'+dy)

One last simulation. This is a queing scenario. Tasks come in once every tick of the clock. There are four queues a task can be assigned to. A task is assigned to the shortest queue. Meanwhile each queue as a 1/4 probability of clearing one item at each tick of the clock. We build the PDF for the maximum length any queue has at any time.

The first argument to queue is the number of ticks of the clock. The second argument is the list of lengths of the queues. It returns a PDF, not just on the current queue size, but also on the longest queue it has seen.

> queue :: Int -> [Int] -> P Double (Int, [Int])
> queue 0 ls = returnP (maximum ls, ls)
> queue n ls = do
>     (longest, ls1) <- queue (n-1) ls
>     ls2 <- forM ls1 $ \l -> do
>         served <- bernoulliP (1/4)
>         returnP $ if served && l > 0 then l-1 else l
>     let ls3 = L.sort $ head ls2+1 : tail ls2
>     returnP (longest `max` maximum ls3, ls3)

For the queing simulation the tail probability is around $5\times 10^{-4}$ despite the fact that we have discarded a vast possible set of possible outcomes.

It's a little ugly that trimSize is a global constant:

> trimSize = 512

The correct solution is probably to separate the probability "syntax" from its "semantics". In other words, we should implement a free monad supporting the language of probability with suitable constructors for bernoulliP and choiceP. We can then write a separate interpreter which takes a trimSize as argument. This has another advantage too: the Monad above isn't a true monad. It uses a greedy approach to discarding probabilities and different rearrangements of the code, that ought to give identical results, may end up diferent. By using a free monad we ensure that our interface is a true monad and we can put the part of the code that breaks the monad laws into the interpreter. The catch is that my first attempt at writing a free monad resulted in code with poor performance. So I'll leave an efficient version as an exercise :-)

> main = do
>     print $ runP $ random_walk1 2048
>     print $ runP $ random_walk2 32 (0, 0)
>     print $ runP $ do
>         (r, _) <- queue 128 [0, 0, 0, 0]
>         returnP r

tag:blogger.com,1999:blog-11295132.post-6611224798233016626

Extensions

What is a photon?

sigfpe Aug 12, 2017 Updated Aug 12, 2017

Show full content

Introduction

Popular science writing about quantum mechanics leaves many people full of questions about the status of photons. I want to answer some of these without using any tricky mathematics.

One of the challenges is that photons are very different to ordinary everyday objects like billiard balls. This is partly because photons are described by quantum mechanics whereas billiard balls are better modelled with classical Newtonian mechanics. Quantum mechanics defies many of our intuitions. But it's also because the word photon plays by different linguistic rules to billiard ball. I hope to explain why.

One of my goals is to avoid saying anything original. I'm largely going remove the mathematics from material I first learnt from three or so courses I took at Cambridge University many years ago: Quantum Mechanics, Solid State Physics and Quantum Field Theory. I also learnt about some of this from David Miller at Stanford University who talked a little about what properties it is meaningful to apply to a photon. (I hope I haven't misrepresented him too badly.)

The simple harmonic oscillator

Here's a mass hanging on a spring:

Suppose it's initially sitting in equilibrium so that the net force acting on it is zero. Now we lift the mass a small distance and let it go. Because we lifted it, we shortened the spring, reducing its tension. This means the force due to gravity is now more than the spring tension and the mass falls. Eventually it falls below the equilibrium point, increasing the tension in the spring so there is a net force pulling it back up again. To a good approximation, the force restoring the mass to its equilibrium point is proportional to how far it has been displaced. When this happens we end up with oscillating motion where the mass bounces up and down. Here's what a graph of its displacement looks like over time:

It's actually a sine wave but that detail doesn't matter for us right now.

An oscillator where the restoring force is proportional to the displacement from the equilibrium point is called a simple harmonic oscillator and its oscillation is always described by a sine wave.

Note that I'm ignoring friction here. This is a reasonable approximation for many physical systems.

Masses on springs aren't all that important in themselves. But simple harmonic oscillators are very common. Another standard example is the pendulum swinging under the influence of gravity:

At a more fundamental level, an example might be an atom in a crystal being held in place by electrostatic forces from its neighbouring atoms.

If you have one of these systems, then in principle you can set it in motion with as little energy as you like. Pull a mass on a spring down a little bit and it will bounce back up, oscillating a certain amount. Pull the mass down half the amount and it'll bounce with oscillations half the size. In principle we could keep repeating this experiment, each time starting with the mass displaced half the amount we tried previously. In other words, a simple harmonic oscillator can have any energy we like. The spectrum of possible energies of one of these oscillators is continuous. (Note that the word spectrum here is merely physicist-speak for a set of possible values.) If we can set one in motion with 1 unit of energy then we can also set it oscillating with 0.5 units, or 0.01 units, or 0.000123 units of energy.

Quantum mechanics

Everything I've said above is assuming that classical Newtonian mechanics is valid. But we know that for very small systems, around the size of a few atoms or smaller, we need to use quantum mechanics. This is an enormous topic but I'm only going to extract one basic fact. According to quantum mechanics, a simple harmonic oscillator isn't free to oscillate with any energy you like. The possible energy levels, the spectrum of the system, is discrete. There is a lowest energy level, and then all of the energy levels above that are equally spaced like so, going up forever:

We usually call the lowest energy level the ground state or vacuum state and call the higher levels excited states.

The spacing of the energy levels depends on the stiffness of the system, which is just a measure of how much the restoring force increases with displacement from equilibrium. Stiffer systems will have a higher frequency of oscillation and a bigger spacing between the energy levels.

(I'm deliberately not saying anything about why we get discrete energy levels in quantum mechanics. I just want to use this one fact so I can get on and talk about photons eventually.)

In practice the difference in energy between one level and the next is tiny. This means that if you're literally fiddling about with a mass on a spring you won't ever feel the discreteness. The amount your hand trembles is many orders of magnitude greater than the effect of this discreteness. Nonetheless, it is extremely important when modeling microscopic systems.

Quantum linguistics

Here are some English sentences we could say about the kinds of systems I've described so far:

This system is in the ground state.
That system is in its first excited state
This system is at an energy level higher than that system
After allowing these two similar oscillators to interact, the energy level of this oscillator went down and the energy level of that one went up by the same amount.

Now I want to introduce the (count) noun quantum, with plural quanta. The idea here is not that I'm telling you about a new entity. I want to present this as a new way to talk about things I've already introduced. So rather than give a definition of quantum I will instead show how you can rewrite the above sentences using the language of quanta:

There are no quanta in this system
That system has one quantum of energy
This system has more quanta than that one
Some quanta were transferred from this system to that system.

Those sentences make it seem like I'm talking about a new kind of object - the quantum. But I'm not. They're just a manner of speaking about energy levels. I hope I've given you enough examples to get the idea.

Just in case you think it's weird to talk about energy levels in terms of quanta, I'd like to remind you that you already do this all the time with money. Dollar bills are actual objects that exist in the world. But money in your bank account isn't. Somewhere in some database is a representation of how much money you have. You might say "I have one hundred dollars in my savings account" But those dollars certainly don't exist as distinct entities. It doesn't really make sense to talk about the thirty-seventh dollar in your bank account. You can transfer dollars from one account to another, and yet what's really happening is that two totals are being adjusted. We treat these accounts a lot like they're containers holding individual objects called dollars. Certainly our language is set up like that. But we know that it's really just the totals that have any kind of representation. The same goes for quanta. It's just a manner of speaking about systems that can have different amounts of energy and where the spectrum of energy levels forms a ladder with equally spaced rungs. Because of your experience with money I probably don't need to give you any more examples.

One more bit of terminology: when the spectrum of energies is discrete it's said to be quantised.

Coupled systems

Let's return to classical physics with a slightly more complex system consisting of two masses connected to springs. We ignore gravity now:

We restrict ourselves to just considering back and forth motion constrained along a horizontal line. This is a coupled system. If the left mass moves to the right, not only does it experience a restoring force pushing it left, but the mass on the right will experience more of a force pushing it to the left. We can't treat the masses as independent and so we don't get the simple solution of each mass always oscillating with a sine wave.

For this particular problem though there's a trick to turn it into a pair of harmonic oscillators. The idea is to consider the pair of masses as a single entity. We can think of the motion centre of mass of the pair, the midpoint between them, as being one variable that describes this entity. Let's call its motion the external motion. We can also think of the distance between the two masses in the pair as being the system's internal motion. (I'm just using internal and external as convenient names. Don't read too much into them.) It turns out that when you analyse this using classical dynamics the internal motion and the external motion act like independent quantities. What's more, each one behaves exactly like it's simple harmonic. So we get one sine wave describing the overall motion of the pair, and another one that describes how the elements of the pair oscillate with respect to each other.

The frequencies of the internal and external motions are typically different. So you can end up with some quite complicated motions with two different frequencies beating against each other.

When we're able to find ways to split up the motion into independent quantities, each of which is simple harmonic, each kind of motion is said to be a normal mode.

When you have independent normal modes, you can treat them independently in quantum mechanics too. So what we get is that the spectrum of possible energy levels for this system is, in some sense, two-dimensional. We can put quanta into the internal oscillation and we can also put quanta into the external oscillation. Because these modes have different frequencies the quanta for each mode correspond to different amounts of energy.

(And a reminder: when talking about quantum mechanics I'm not talking literally about masses on springs. I'm talking about physical systems that have equations of motion that mean they behave like masses on springs. In this case it might be a pair of particles trapped in a microscopic well with a repulsive force between them.)

Solid state physics

Now I'm going to jump from just two masses to a large number of them. For example, the behavior of trillions of atoms in a solid crystal can be approximately modelled by a grid of masses and springs, of which the following diagram is just a tiny piece:

A real crystal would be arranged in a 3D lattice but I've drawn 2D here for convenience.

Think of the springs as both pushing apart atoms that get close, and pulling together atoms that move apart.

This is a highly coupled system. Ultimately every atom in our lattice is connected to every other one, either directly, or indirectly. Nonetheless, it is still possible to find normal modes. The normal modes all have the same basic form: they are all sinusoidal waves of displacement traveling in some direction with some speed and oscillation frequency. Each of these modes consists of waves that extend through the entire crystal, with fixed spacing between parallel planar wavefronts. This type of waves is known as a plane wave. If the system is perfectly harmonic, so the restoring force is precisely proportional to the displacement, then each direction and frequency of wave oscillates its way through the crystal completely independently of any other. Just as how in the example with two masses any possible oscillation is a combination of internal and external motion, for a crystal lattice any motion is a combination of these plane waves. (Decomposing any oscillation as a combination of plane waves is known as computing its Fourier transform.

Now we're ready to consider this situation quantum mechanically. Because each plane wave is a normal mode, we can treat each one as an independent simple harmonic oscillator. This means that the energy in each plane wave is quantised. So when we consider a crystal lattice quantum mechanically we find that its states consist of plane waves propagating through it, but where the amount of energy in each wave is given by a discrete spectrum. So again we can talk about how many quanta there are in each mode.

Linguistically it gets a bit more interesting now. Each plane wave is associated with a particular direction and speed so it makes sense to talk of these quanta as having a direction and speed. But note that statements involving quanta are still really just sentences about energy levels. So, for example, the statement "the mode of this system with this velocity and frequency is in its first excited state" is, by definition, exactly the same as "this system has precisely one quantum with this velocity and frequency". In particular, when we write sentences like these we aren't implying that there is some new kind of object, the quantum, that has suddenly attached itself to our crystal. The quanta are properties of the lattice. By the way, in the particular case of vibrating atoms in a lattice, the quanta are known by a special name: phonons.

Quantum field theory and photons

And now we're ready to move onto photons.

In classical physics, electromagnetism is described by Maxwell's equations. Maxwell's equations say that a varying magnetic field generates an electric field and a varying electric field generates a magnetic field. The result is that it is possible for an oscillating electric field to create an oscillating electric field so that an electric field can propagate through space on its own without the help of electric charges or electric currents or any other kind of `generator'. As these electric fields also produce magnetic fields that propagate with them, the whole thing is called an electromagnetic wave.

Just like displacements in a crystal lattice, an electromagnetic wave also has normal modes. The normal modes are plane waves traveling at the speed of light in a particular directions with a given frequency. You have personal experience of this. Visible light is electromagnetic radiation with a frequency of around 500 THz. Wifi uses signals at around 5 GHz. The radio might use signals at around 100 MHz. When you surf the web wirelessly while listening to the radio, the wifi signals don't interfere with your vision or the radio signal. (Actually, wifi might interfere with the radio signals, but not because of the 5 GHz signals. It might happen if badly manufactured hardware emits stray signals around the 100 MHz band.) That's because these waves pass through each other without being coupled to each other in any way. And at this point you might already be guessing what a photon is. For each choice of frequency and direction (and also polarisation, but that's just a detail) the amount of energy that can be in the corresponding mode is quantised. For the electromagnetic field the quanta are called photons.

And that's it!

Electromagnetic waves can be thought of as being made up of different oscillation modes. Because of quantum mechanics, each mode contains an amount of energy that is quantised to be a whole number multiple of some base amount. Although the thing that really matters is the total amount of energy in the modes, it can still be useful to talk about this total as if it's a collection of entities called photons.

One thing to notice is that the normal modes for an electromagnetic wave are plane waves that are extended in space. In principle all the way across the universe but for practical problems physicists often consider electromagnetic waves in a large but finite box. This means that adding a quantum to a system has an effect that extends across the entire system. That makes it problematic to talk about the location of a photon.

Caveat

Physicists sometimes use the word photon in slightly different but related ways. I've described what I think of as the core definition as presented in many courses on quantum field theory.

Acknowledgements

Thanks to @dmoore2718 for encouraging me to edit this document down to a better size.

tag:blogger.com,1999:blog-11295132.post-8305461568159682840

Extensions

Self-referential logic via self-referential circuits

sigfpe Jul 15, 2017 Updated Apr 22, 2026

Show full content

Introduction

TL;DR The behaviour of a certain kind of delay component has a formal similarity to Löb's theorem which gives a way to embed part of provability logic into electronic circuits.

Here's a famous paradoxical sentence:

This sentence is false

If it's false then it's true and if it's true then it's false.

Here's a paradoxical electronic circuit:
The component in the middle is an inverter. If the output of the circuit is high then its input is high and then it's output must be low, and vice versa.

There's a similarity here. But with a bit of tweaking you can turn the similarity into an isomorphism of sorts.

In the first case we avoid paradox by noting that in the mathematical frameworks commonly used by mathematicians it's impossible, in general, for a statement to assert it's own falsity. Instead, a statement can assert its own unprovability and then we get Gödel's incompleteness theorems and a statement that is apparently true and yet can't be proved.

In the second case we can't model the circuit straightforwardly as a digital circuit. In practice it might settle down to a voltage that lies between the official high and low voltages so we have to model it as an analogue circuit. Or instead we can introduce a clock and arrange that the feedback in the circuit is delayed. We then get an oscillator circuit that can be thought of as outputting a stream of bits.

The observation I want to make is that if the feedback delay is defined appropriately, these two scenarios are in some sense isomorphic. This means that we can model classic results about provability, like Gödel's incompleteness theorems, using electronic circuits. We can even use such circuits to investigate what happens when logicians or robots play games like Prisoner's Dilemma. I'll be making use of results found in Boolos' book on The Logic of Provability and some ideas I borrowed from Smoryński's paper on Fixed Point Algebras. I'll be assuming the reader has at least a slight acquaintance with ithe ideas behind provability logic.

Provability Logic

There are many descriptions of provability logic (aka GL) available online, so I'm not going to repeat it all here. However, I've put some background material in the appendix below and I'm going to give a very brief reminder now.

Start with (classical) propositional calculus which has a bunch of variables with names like $a, b, c, d, \ldots$ and connectives like $\wedge$ for AND, $\vee$ for OR, $\neg$ for NOT and $\rightarrow$ for implication. (Note that $a\rightarrow b = \neg a\vee b$.)

Provability logic extends propositional calculus by adding a unary operator $\Box$. (I apologise, that's meant to be a □ but it's coming out like $\Box$ in LaTeX formulae. I think it's a bug in Google's LaTeX renderer.) The idea is that $\Box p$ asserts that $p$ is provable in Peano Arithmetic, aka PA. In addition to the axioms of propositional calculus we have

$\Box(p\rightarrow q)\rightarrow\Box p\rightarrow\Box q$

and

$\Box p\rightarrow\Box\Box p$

as well as a rule that allows us to deduce $\Box p$ from $p$.

We also have this fixed point property:

Let $F(p)$ be any predicate we can write in the language of GL involving the variable $p$, and suppose that every appearance of $p$ in $F(p)$ is inside a $\Box$, e.g. $F(p)=\Box p\vee\Box(\neg p)$. Then there is a fixed point, i.e. a proposition $q$ that makes no mention of $p$ such that $q\leftrightarrow F(q)$ is a theorem. In effect, for any such $F$, $q$ is a proposition that asserts $F(q)$.

See the appendix for a brief mention of why we should expect this to be true.

From the fixed point property we can deduce Löb's theorem: $\Box(\Box p\rightarrow p)\rightarrow\Box p$. There is a proof at wikipedia that starts from the fixed point property.

We can also deduce the fixed point property from Löb's theorem so it's more usual to take Löb's theorem as an axiom of GL and show that the fixed point property follows. You can think of Löb's theorem as a cunning way to encode the fixed point property. In fact you can argue that it's a sort of Y-combinator, the function that allows the formation of recursive fixed points in functional programming languages. (That's also, sort of, the role played by the loeb function I defined way back. But note that loeb isn't really a proof of Löb's theorem, it just has formal similarities.)

Back to electronic circuits

In order to make digital circuits with feedback loops well-behaved I could introduce a circuit element that results in a delay of one clock cycle. If you insert one of these into the inverter circuit I started with you'll end up with an oscillator that flips back and forth between 0 and 1 on each clock cycle. But I want to work with something slightly stricter. I'd like my circuits to eventually stop oscillating. (I have an ulterior motive for studying these.) Let me introduce this component:
It is intended to serve as a delayed latch and I'll always have the flow of data being from left to right. The idea is that when it is switched on it outputs 1. It keeps outputting 1 until it sees a 0 input. When that happens, then on the next clock cycle its output drops to 0 and never goes back up to 1 until reset.

Because the output of our delay-latch isn't a function of its current input, we can't simply describe its operation as a mathematical function from $\{0,1\}$ to $\{0,1\}$. Instead let's think of electronic components as binary operators on bitstreams, i.e. infinite streams of binary digits like ...00111010 with the digits emerging over time starting with the one written on the right and working leftwards. The ordinary logic gates perform bitwise operations which I'll represent using the operators in the C programming language. For example,

...001110 & ...101010 = ...001010

and

~...101 = ...010

and so on. Let's use □ to represent the effect of latch-delay on a bitstream. We have, for example,

□...000 = ...001

and

□...11101111 = ...00011111.

The operator □ takes the (possibly empty) contiguous sequence of 1's at the end of the bitstream, extends it by one 1, and sets everything further to the left to 0. If we restrict ourselves to bitstreams that eventually become all 0's or all 1's on the left, then bitstreams are in one-to-one correspondence with the integers using the twos complement representation. For example ...111111, all 1's, represents the number -1. I'll simply call the bistreams that represent integers integers. With this restriction we can use a classic C hacker trick to write □p=p^(p+1) where ^ is the C XOR operator. The operator □ outputs the bits that get flipped when you add one.

Let's use the symbol → so that a → b is shorthand for ~a|b. Here are some properties of □:

1. □(-1) = -1

2. □p → □□p = -1

3. □(p → q) → □p → □q = -1

In addition we have the fixed point property:

Let F(p) be any function of p we can write using □ and the bitwise logical operators and such that all occurrences of p occur inside □. Then there is a unique bitstream q such that q=F(q).

We can make this clearer if we return to circuits. F(p) can be thought of as a circuit that takes p as input and outputs some value. We build the circuit using only boolean logic gates and delay-latch. We allow feedback loops, but only ones that go through delay-latches. With these restrictions it's pretty clear that the circuit is well-behaved and deterministically outputs a bitstream.

We also have the Löb property:

4. □(□p → p) → □p = -1

We can see this by examining the definition of □. Intuitively it says something like "once □ has seen a 0 input then no amount of setting input bits to 1 later in the stream make any different to its output".

I hope you've noticed something curious. These properties are extremely close to the properties of $\Box$ in GL. In fact, these electronic circuits form a model of the part of GL that doesn't involve variable names, i.e. what's known as letterless GL. We can formalise this:

1. Map $\bot$ to a wire set to 0, which outputs ...000 = 0.

2. Map $\top$ to a wire set to 1, which outputs ...111 = -1.

3. Map $p \circ q$, where $\circ$ is a binary connective, by creating a circuit that takes the outputs from the circuits for $p$ and $q$ and passes them into the corresponding boolean logic gate.

4. Map $\Box p$ to the circuit for $p$ piped through a delay-latch.

For example, let's convert $\Box(\Box\bot\rightarrow\bot)\rightarrow\Box\bot$ into a circuit. I'm translating $a\rightarrow b$ to the circuit for $\neg a\vee b$.

I'm using red wires to mean wires carrying the value 1 rather than 0. I hope you can see that this circuit eventually settles into a state that outputs nothing but 1s.

We have this neat result:

Because delay-latch satisfies the same equations as $\Box$ in provability logic, any theorem, translated into a circuit, will produce a bistream of just 1s, i.e. -1.

But here's a more surprising result: the converse is true.

If the circuit corresponding to a letterless GL proposition produces a bistream of just 1s then the proposition is actually a theorem of GL.

I'm not going to prove this. (It's actually a disguised form of lemma 7.4 on p.95 of Boolos' book.) In the pictured example we got ...1111, so the circuit represents a theorem. As it represents Löb's theorem for the special case $p=\bot$ we should hope so. More generally, any bitstream that represents an integer can be converted back into a proposition that is equivalent to the original proposition. This means that bitstreams faithfully represent propositions of letterless GL. I'm not going to give the translation here but it's effectively given in Chapter 7 of Boolos. I'll use $\psi(p)$ to represent the translation from propositions to bitstreams via circuits that I described above. Use $\phi(b)$ to represent the translation of bitstream $b$ back into propositions. We have $p\leftrightarrow\phi(\psi(p))$. But I haven't given a full description of $\phi$ and I haven't proved here that it has this property.

Circuits with feedback

In the previous section I considered letterless propositions of GL. When these are translated into circuits they don't have feedback loops. But we can also "solve equations" in GL using circuits with feedback. The GL fixed point theorem above says that we can "solve" the equation $p\leftrightarrow F(p)$, with one letter $p$, to produce a letterless proposition $q$ such that $q\leftrightarrow F(q)$. Note here that $p$ is a letter in the language of GL. But I'm using $q$ to represent a proposition in letterless GL. If we build a circuit to represent $F$, and feed its output back into where $p$ appears, then the output bitstream represents the fixed point. Here's a translation of the equation $p \leftrightarrow \neg(\Box p \vee \Box\Box\Box p)$: I'll let you try to convince yourself that such circuits always eventually output all 0's or all 1's. When we run the circuit we get the output ...1111000 = -8. As this is not -1 we know that the fixed point isn't a theorem. If I'd defined $\phi$ above you could use it to turn the bitstream back into a proposition.

The same, syntactically (optional section)

I have a Haskell library on github for working with GL: provability. This uses a syntactic approach and checks propositions for theoremhood using a tableau method. We can use it to analyse the above example with feedback. I have implemented a function, currently called value', to perform the evaluation of the bitstream for a proposition. However, in this case the fixedpoint function computes the fixed point proposition first and then converts to a bitstream rather than computing the bitstream directly from the circuit for F:

> let f p = Neg (Box p \/ Box (Box (Box p)))
> let Just p = fixedpoint f
> p
Dia T /\ Dia (Dia T /\ Dia (Dia T /\ Dia T))
> value' p
-8

(Note that Dia p means $\Diamond p = \neg\Box\neg p$.)

The function fixedpoint does a lot of work under the hood. (It uses a tableau method to carry out Craig interpolation.) The circuit approach requires far less work.

Applications

1. Programs that reason about themselves

In principle we can write a program that enumerates all theorems of PA. That means we can use a quine trick to write a computer program that searches for a proof, in PA, of its own termination. Does such a program terminate?

We can answer this with Löb's theorem. Let $p =$ "The program terminates". The program terminates if it can prove its termination. Formally this means we assume $\Box p\rightarrow p$. Using one of the derivation rules of GL we get $\Box(\Box p\rightarrow p)$. Löb's theorem now gives us $\Box p$. Feed that back into our original hypothesis and we get $p$. In other words, we deduce that our program does in fact terminate. (Thanks to Sridhar Ramesh for pointing this out to me.)

But we can deduce this using a circuit. We want a solution to $p\leftrightarrow \Box p$. Here's the corresponding circuit: It starts by outputting 1's and doesn't stop. In other words, the fixed point is a theorem. And that tells us $p$ is a theorem. And hence that the program terminates.

2. Robots who reason about each others play in Prisoner's Dilemma

For the background to this problem see Robust Cooperation in the Prisoner's Dilemma at LessWrong. We have two robot participants $A$ and $B$ playing Prisoner's Dilemma. Each can examine the other's source code and can search for proofs that the opponent will cooperate. Suppose each robot is programmed to enumerate all proofs of PA and cooperate if it finds a proof that its opponent will cooperate. Here we have $p =$ "A will cooperate" and $q =$ "B will cooperate". Our assumptions about the behaviour of the robots are $p \leftrightarrow \Box q$ and $q \leftrightarrow \Box p$, and hence that $p \leftrightarrow \Box\Box p$. This corresponds to the circuit: This outputs ...1111 = -1 so we can conclude $p$ and hence that these programs will cooperate. (Note that this doesn't work out nicely if robot B has a program that doesn't terminate but whose termination isn't provable in the formal system A is using. That means this approach is only good for robots that want to cooperate and want to confirm such cooperation. See the paper for more on this.)

At this point I really must emphasise that these applications are deceptively simple. I've shown how these simple circuits can answer some tricky problems about provability. But these aren't simply the usual translations from boolean algebra to logic gates. They work because circuits with delay-latch provide a model for letterless provability logic and that's only the case because of a lot of non-trivial theorem proving in Boolos that I haven't reproduced here. You're only allowed to use these simple circuits once you've seen the real proofs :-)

Things I didn't say above

1. I described the translation from propositions to circuits that I called $\psi$ above. But I didn't tell you what $\phi$ looks like. I'll leave this as an exercise. (Hint: consider the output from the translation of $\Box^n\bot$ into a circuit.)

2. The integers, considered as bistreams, with the bitwise operators, and the unary operator □p=p^(p+1), form an algebraic structure. For example, if we define ⋄p=~□~p we have a Magari algebra. Structures like these are intended to capture the essential parts of self-referential arguments in an algebraic way.

3. Because of the interpretation of □ as a delayed latch in a circuit you could view it as saying "my input was always true until a moment ago". This surely embeds provability logic in a temporal logic of some sort.

4. (Deleted speculations about tit-for-tat that need rethinking.)

5. For even the most complex letterless proposition in Boolos you could check its theoremhood with a pretty small circuit. You could even consider doing this with a steam powered pneumatic circuit. I had to say that to fulfil a prophecy and maintain the integrity of the timeline.

Appendix on provability

The modern notion of a proof is that it is a string of symbols generated from some initial strings called "axioms" and some derivation rules that make new strings from both axioms and strings you've derived previously. Usually we pick axioms that represent "self-evident" truths and we pick derivation rules that are "truth-preserving" so that every proof ends at a true proposition of which it is a proof. The derivation rules are mechanical in nature: things like "if you have this symbol here and that symbol there then you can replace this symbol with that string you derived earlier" etc.

You can represent strings of symbols using numbers, so-called Gödel numbers. Let's pick a minimal mathematical framework for working with numbers: Peano Arithmetic, aka PA. Let's assume we've made some choice of Gödel numbering scheme and when $p$ is a proposition, write $[p]$ for the number representing $p$. You can represent the mechanical derivation rules as operations on numbers. And that makes it possible to define a mathematical predicate $Prov$ that is true if and only if its argument represents a provable proposition.

In other words, we can prove $Prov([p])$ using PA if and only if $p$ is a proposition provable in PA.

The predicate $Prov$ has some useful properties:

1.If we can prove $p$, then we can prove $Prov([p])$.

We take the steps we used to prove $p$, and convert everything to propositions about numbers. If $Prov$ is defined correctly then we can convert that sequence of numbers into a sequence of propositions about those numbers that makes up a proof of $Prov(p)$.

2.$Prov([p\rightarrow q])$ and $Prov([p])$ imply $Prov([q])$

A fundamental step in any proof is modus ponens, i.e. that $p\rightarrow q$ and $q$ implies $p$. If $Prov$ does its job correctly then it had better know about this.

3.$Prov([p])$ implies $Prov([Prov([p])])$

One way is to prove this is to use Löb's theorem.

4. $Prov([\top])$

The trivially true statement had better be provable or $Prov$ is broken.

Constructing $Prov$ is conceptually straightforward but hard work. I'm definitely not going to do it here.

And there's one last thing we need: self-reference. If $p$ is a proposition, how can we possibly assert $Prov([p])$ without squeezing a copy of $[p]$ inside $p$? I'm not going to do that here either - just mention that we can use a variation of quining to achieve this. That allows us to form a proposition $p$ for which we can prove $p\leftrightarrow Prov([p])$. In fact, we can go further. We can find propositions that solve $p\leftrightarrow F(p)$ for any predicate $F(p)$ built from the usual boolean operations and $p$ as long as all of the occurrences of $p$ are inside the appearances of $Prov$. Even though we can't form a proposition that directly asserts its own falsity, we can form one that asserts that it is unprovable, or one that asserts that you can't prove that you can't prove that you can prove it, or anything along those lines.

Anyway, all that $[]$ and $Prov$ business is a lot of hassle. Provability logic, also known as GL, is intended to capture specifically the parts of PA that relate to provability. GL is propositional calculus extended with the provability operator $\Box$. The intention is that if $p$ is a proposition, $\Box p$ is a proposition in GL that represents $Prov([p])$ in PA. The properties of $Prov$ above become the axioms and derivation rules of GL in the main text.

tag:blogger.com,1999:blog-11295132.post-1013073177509119669

Extensions

A relaxation technique

sigfpe Jun 7, 2017 Updated Jun 8, 2017

Show full content

Introduction

Sometimes you want to differentiate the expected value of something. I've written about some tools that can help with this. For example you can use Automatic Differentiation for the derivative part and probability monads for the expectation. But the probability monad I described in that article computes the complete probability distribution for your problem. Frequently this is intractably large. Instead people often use Monte Carlo methods. They'll compute the "something" many times, substituting pseudo-random numbers for the random variables, and then average the results. This provides an estimate of the expected value and is ubiquitous in many branches of computer science. For example it's the basis of ray-tracing and path-tracing algorithms in 3D rendering, and plays a major role in machine learning when used in the form of stochastic gradient descent.

But there's a catch. Suppose we want to compute $E(f(X_1,X_2,\ldots,X_n))$ where each of the $X_i$ belong to the Bernoulli distribution $B(p)$ . I.e. each $X_n$ has a probability $p$ of being 1 and probability $1-p$ of being 0. If we compute this using a Monte Carlo approach we'll repeatedly generate pseudo-random numbers for each of the $X_i$ . Each one will be 0 or 1. This means that our estimate depends on $p$ via subexpressions that can't meaningfully be differentiated with respect to $p$ . So how can we use automatic differentiation with the Monte Carlo method? I'm proposing an approach that may or may not already be in the literature. Whether it is or not, I think it's fun to get there by combining many of the things I've previously talked about here, such as free monads, negative probabilities and automatic differentiation. I'm going to assume you're familiar with using dual numbers to compute derivatives as I've written about this before and wikipedia has the basics.

A probability monad

I want to play with a number of different approaches to using monads with probability theory. Rather than define lots of monads I think that the easiest thing is to simply work with one free monad and then provide different interpreters for it.

First some imports:

> import Control.Monad
> import qualified System.Random as R
> import qualified Data.Map.Strict as M

I'm going to use a minimal free monad that effectively gives us a DSL with a new function that allows us to talk about random Bernoulli variables:

> data Random p a = Pure a | Bernoulli p (Int -> Random p a)

The idea is that Pure a represents the value a and Bernoulli p f is used to say "if we had a random value x, f x is the value we're interested in". The Random type isn't going to do anything other than represent these kinds of expressions. There's no implication that we actually have a random value for x yet.

> instance Functor (Random p) where
>     fmap f (Pure a) = Pure (f a)
>     fmap f (Bernoulli p g) = Bernoulli p (fmap f . g)


> instance Applicative (Random p) where
>     pure = return
>     (<*>) = ap


> instance Monad (Random p) where
>     return = Pure
>     Pure a >>= f = f a
>     Bernoulli p g >>= f = Bernoulli p (\x -> g x >>= f)

We'll use bernoulli p to represent a random Bernoulli variable drawn from $B(p)$ .

> bernoulli :: p -> Random p Int
> bernoulli p = Bernoulli p return

So let's write our first random expression:

> test1 :: Random Float Float
> test1 = do
>     xs <- replicateM 4 (bernoulli 0.75)
>     return $ fromIntegral $ sum xs

It sums 4 Bernoulli random variables from $B(\frac{3}{4})$ and converts the result to a Float. The expected value is 3.

We don't yet have a way to do anything with this expression. So let's write an interpreter that can substitute pseudo-random values for each occurrence of bernoulli p:

It's essentially interpreting our free monad as a state monad where the state is the random number seed:

> interpret1 :: (Ord p, R.Random p, R.RandomGen g) => Random p a -> g -> (a, g)
> interpret1 (Pure a) seed = (a, seed)
> interpret1 (Bernoulli prob f) seed = 
>     let (r, seed') = R.random seed
>         b       = if r <= prob then 1 else 0
>     in interpret1 (f b) seed'

You can use the expression R.getStdRandom (interpret1 test1) if you want to generate some random samples for yourself.

We're interested in the expected value, so here's a function to compute that:

> expect1 :: (Fractional p, Ord p, R.Random p, R.RandomGen g) => Random p p -> Int -> g -> (p, g)
> expect1 r n g = 
>     let (x, g') = sum1 0 r n g
>     in (x/fromIntegral n, g')


> sum1 :: (Ord p, Num p, R.Random p, R.RandomGen g) => p -> Random p p -> Int -> g -> (p, g)
> sum1 t r 0 g = (t, g)
> sum1 t r n g =
>     let (a, g') = interpret1 r g
>     in sum1 (t+a) r (n-1) g'

You can test it out with R.getStdRandom (expect1 test1 1000). You should get values around 3.

We can try completely different semantics for Random. This time we compute the entire probability distribution:

> interpret2 :: (Num p) => Random p a -> [(a, p)]
> interpret2 (Pure a) = [(a, 1)]
> interpret2 (Bernoulli p f) =
>     scale p (interpret2 (f 1)) ++ scale (1-p) (interpret2 (f 0))


> scale :: Num p => p -> [(a, p)] -> [(a, p)]
> scale s = map (\(a, p) -> (a, s*p))

You can try it with interpret2 test1.

Unfortunately, as it stands it doesn't collect together multiple occurrences of the same value. We can do that with this function:

> collect :: (Ord a, Num b) => [(a, b)] -> [(a, b)]
> collect = M.toList . M.fromListWith (+)

And now you can use collect (interpret2 test1).

Let's compute some expected values:

> expect2 :: (Num p) => Random p p -> p
> expect2 r = sum $ map (uncurry (*)) (interpret2 r)

The value of expect2 test1 should be exactly 3. One nice thing about interpret2 is that it is differentiable with respect to the Bernoulli parameter when this is meaningful. Unfortunately it has one very big catch: the value of interpret2 can be a very long list. Even a small simulation can results in lists too big to store in the known universe. But interpret1 doesn't produce differentiable results. Is there something in-between these two interpreters?

Importance sampling

Frequently in Monte Carlo sampling it isn't convenient to sample from the distribution you want. For example it might be intractably hard to do so, or you might have proven that the resulting estimate has a high variance. So instead you can sample from a different, but possibly related distribution. This is known as importance sampling. Whenever you do this you must keep track of how "wrong" your probability was and patch up your expectation estimate at the end. For example, suppose a coin comes up heads 3/4 of the time. Instead of simulating a coin toss that comes up 3/4 of the time you could simulate one that comes up heads half of the time. Suppose at one point in the simulation it does come up heads. Then you used a probability of 1/2 when you should have used 3/4. So when you compute the expectation you need to scale the contribution from this sample by (3/4)/(1/2) = 3/2. You need so scale appropriately for every random variable used. A straightforward way to see this for the case of a single Bernoulli variable is to note that

$E(f(X)) = pf(1)+(1-p)f(0) = q\frac{p}{q}f(1)+(1-q)\frac{1-p}{1-q}f(0)$ .

We've replaced probabilities $p$ and $1-p$ with $q$ and $1-q$ but we had to scale $f$ appropriately in each of the cases $f(0)$ and $f(1)$ to keep the final value the same. I'm going to call the scale value the importance. If we generate $n$ random numbers in a row we need to multiply all of the $n$ importance values that we generate. This is a perfect job for the Writer monad using the Product monoid. (See Eric Kidd's paper for some discussion about the connection between Writer and importance sampling.) However I'm just going to write an explicit interpreter for our free monad to make it clear what's going where.

This interpreter is going to take an additional argument as input. It'll be a rule saying what probability we should sample with when handling a variable drawn from $B(p)$ . The probability should be a real number in the interval $[0,1]$ .

> interpret3 :: (Fractional p, R.RandomGen g) =>
>               (p -> Float) -> Random p a -> g -> ((a, p), g)
> interpret3 rule (Pure a) g = ((a, 1), g)
> interpret3 rule (Bernoulli p f) g = 
>     let (r, g') = R.random g
>         prob = rule p
>         (b, i)  = if (r :: Float) <= prob
>           then (1, p/realToFrac prob)
>           else (0, (1-p)/realToFrac (1-prob))
>         ((a, i'), g'') = interpret3 rule (f b) g'
>     in ((a, i*i'), g'')

Here's the accompanying code for the expectation:

> expect3 :: (Fractional p, R.RandomGen g) =>
>            (p -> Float) -> Random p p -> Int -> g -> (p, g)
> expect3 rule r n g = 
>     let (x, g') = sum3 rule 0 r n g
>     in (x/fromIntegral n, g')


> sum3 :: (Fractional p, R.RandomGen g) =>
>         (p -> Float) -> p -> Random p p -> Int -> g -> (p, g)
> sum3 rule t r 0 g = (t, g)
> sum3 rule t r n g =
>     let ((a, imp), g') = interpret3 rule r g
>     in sum3 rule (t+a*imp) r (n-1) g'

For example, you can estimate the expectation of test1 using unbiased coin tosses by evaluating R.getStdRandom (expect3 (const 0.5) test1 1000).

Generalising probability

Did you notice I made my code slightly more general than seems to be needed? Although I use probabilities of type Float to generate my Bernoulli samples, the argument to the function bernoulli can be of a more general type. This means that we can use importance sampling to compute expected values for generalised measures that take values in a more general algebraic structure than the interval [0,1]. For example, we could use negative probabilities. An Operational Interpretation of Negative Probabilities and No-Signalling Models by Adamsky and Brandenberger give a way to interpret expressions involving negative probabilities. We can implement it using interpret3 and the rule \p -> abs p/(abs p+abs (1-p)). Note that it is guaranteed to produce values in the range [0,1] (if you start with dual numbers with real parts that are ordinary probabilities) and reproduces the usual behaviour when given ordinary probabilities.

Here's a simple expression using a sample from " $B(2)$ ":

> test2 = do
>     a <- bernoulli 2
>     return $ if a==1 then 2.0 else 1.0

It's expected value is 3. We can get this exactly using expect2 test2. For a Monte Carlo estimate use

R.getStdRandom (expect3 (\back p -> abs p/(abs p+abs (1-p))) test2 1000)

Note that estimates involving negative probabilities can have quite high variances so try a few times until you get something close to 3 :-)

We don't have to stick with real numbers. We can use this approach to estimate with complex probabilities (aka quantum mechanics) or other algebraic structures.

Discrete yet differentiable

And now comes the trick: automatic differentiation uses the algebra of dual numbers. It's not obvious at all what a probability like $0.5+\epsilon$ means when $\epsilon$ is infinitesimal. However, we can use interpret3 to give it meaningful semantics.

Let'd define the duals in the usual way first:

> data Dual a = D { real :: a, infinitesimal :: a }


> instance (Ord a, Num a) => Num (Dual a) where
>     D a b + D a' b' = D (a+a') (b+b')
>     D a b * D a' b' = D (a*a') (a*b'+a'*b)
>     negate (D a b) = D (negate a) (negate b)
>     abs (D a b) = if a > 0 then D a b else D (-a) (-b)
>     signum (D a b) = D (signum a) 0
>     fromInteger a = D (fromInteger a) 0


> instance (Ord a, Fractional a) => Fractional (Dual a) where
>     fromRational a = D (fromRational a) 0
>     recip (D a b) = let ia = 1/a in D ia (-b*ia*ia)


> instance Show a => Show (Dual a) where
>     show (D a b) = show a ++ "[" ++ show b ++ "]"

Now we can use the rule real to give as a real-valued probability from a dual number. The function expect3 will push the infinitesimal part into the importance value so it doesn't get forgotten about. And now expect3 gives us an estimate that is differentiable despite the fact that our random variables are discrete.

Let's try an expression:

> test3 p = do
>     a <- bernoulli p
>     b <- bernoulli p
>     return $ if a == 1 && b == 1 then 1.0 else 0.0

The expected value is $p^2$ and the derivative is $2p$ . We can evaluate at $p=0.5$ with expect2 (test3 (D 0.5 1)). And we can estimate it with

R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

What's neat is that we can parameterise our distributions in a more complex way and we can freely mix with conventional expressions in our parameter. Here's an example:

> test4 p = do
>     a <- bernoulli p
>     b <- bernoulli (p*p)
>     return $ p*fromIntegral a*fromIntegral b

Try evaluating expect2 (test4 (D 0.5 1)) and

R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

I've collected the above examples together here:

> main = do
>     print =<< R.getStdRandom (interpret1 test1)
>     print $ collect $ interpret2 test1
>     print =<< R.getStdRandom (expect1 test1 1000)
>     print (expect2 test1)
>     print =<< R.getStdRandom (expect3 id test1 1000)
>     print =<< R.getStdRandom (expect3 (const 0.5) test1 1000)
>     print "---"
>     print $ expect2 test2
>     print =<< R.getStdRandom (expect3 (\p -> abs p/(abs p+abs (1-p))) test2 1000)
>     print "---"
>     print $ expect2 (test3 (D 0.5 1))
>     print =<< R.getStdRandom (expect3 real (test3 (D 0.5 1)) 1000)
>     print "---"
>     print $ expect2 (test4 (D 0.5 1))
>     print =<< R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

What just happened?

You can think of a dual number as a real number that has been infinitesimally slightly deformed. To differentiate something we need to deform something. But we can't deform 0 or 1 and have them stay 0 or 1. So the trick is to embed probability sampling in something "bigger", namely importance sampling, where samples carry around an importance value. This bigger thing does allow infinitesimal deformations. And that allows differentiation. This process of turning something discrete into something continuously "deformable" is generally called relaxation.

Implementation details

I've made no attempt to make my code fast. However I don't think there's anything about this approach that's incompatible with performance. There's no need to use a monad. Instead you can track the importance value through your code by hand and implement everything in C. Additionally, I've previously written about the fact that for any trick involving forward mode AD there is another corresponding trick you can use with reverse mode AD. So this method is perfectly comptible with back-propagation. Note also that the dual number importances always have real part 1 which means you don't actually need to store them.

The bad news is that the derivative estimate can sometimes have a high variance. Nonetheless, I've used it successfully for some toy optimisation problems. I don't know if this approach is effective for industrial strength problems. Your mileage may vary :-)

Alternatives

Sometimes you may find that it is acceptable to deform the samples from your discrete distribution. In that case you can use the concrete relaxation.

Continuous variables

The above method can be adapted to work with continuous variables. There is a non-trivial step which I'll leave as an exercise but I've tested it in some Python code. I think it reproduces a standard technique and it gives an alternative way to think about that trick. That article is also useful for ways to deal with the variance issues. Note also that importance sampling is normally used itself as a variance reduction technique. So there are probably helpful ways to modify the rule argument to interpret3 to simultaneously estimate derivatives and keep the variance low.

Personal note

I've thought about this problem a couple of times over the years. Each time I've ended up thinking "there's no easy way to extend AD to work with random variables so don't waste any more time thinking about it". So don't listen to anything I say. Also, I like that this method sort of comes "for free" once you combine methods I've described previously.

Acknowledgements

I think it was Eric Kidd's paper on building probability monads that first brought to my attention that there are many kinds of semantics you can use with probability theory - i.e. there are many interpreters you can write for the Random monad. I think there is an interesting design space worth exploring here.

Answer to exercise

I set the continuous case as an exercise above. Here is a solution.

Suppose you're sampling from a distribution parameterised by $\theta$ with pdf $p(\theta, \cdot)$ . To compute the derivative with respect to $\theta$ you need to consider sampling from $p(\theta+\epsilon, \cdot)$ where $\epsilon$ is an infinitesimal.

$p(\theta+\epsilon, x) = p(\theta, x)+\epsilon\frac{\partial p(\theta, x)}{\partial\theta}$ .

As we don't know how to sample from a pdf with infinitesimals in it, we instead sample using $p$ as usual, but use an importance of

$(p(\theta, x)+\epsilon\frac{\partial p(\theta, x)}{\partial\theta})/p(\theta, x) = 1+\epsilon\frac{\partial}{\partial\theta}\log p(\theta, x)$

The coefficient of the $\epsilon$ gives the derivative. So we need to compute the expectation, scaling each sample with this coefficient. In other words, to estimate $E(f(X))$ we use

$\frac{1}{N}\sum_{i=1}^N f(x_i)\frac{\partial}{\partial\theta}\log p(\theta, x_i)$

where the $x_i$ are drawn from the original distribution. This is exactly what is described at Shakir Mohamed's blog.

Final word

I managed to find the method in the literature. It's part of the REINFORCE method. For example, see equation (5) there.

tag:blogger.com,1999:blog-11295132.post-431820939580351638

Extensions

Logarithms and exponentials of functions

sigfpe Feb 5, 2017 Updated Apr 22, 2026

Show full content

Introduction

A popular question in mathematics is this: given a function $f$, what is its "square root" $g$ in the sense that $g(g(x)) = f(x)$. There are many questions about this on mathoverflow but it's also a popular subject in mathematics forums for non-experts. This question seems to have a certain amount of notoriety because it's easy to ask but hard to answer fully. I want to look at an approach that works nicely for formal power series, following from the Haskell code I wrote here. There are some methods for directly finding "functional square roots" for formal power series that start as $z a_2z^2 a_3z^3 \ldots$, but I want to approach the problem indirectly. When working with real numbers we can find square roots, say, by using $\sqrt{x}=\exp(\frac{1}{2}\log{x})$. I want to use an analogue of this for functions. So my goal is to make sense of the idea of the logarithm and exponential of a formal power series as composable functions. Warning: the arguments are all going to be informal.

Notation

There's potential for a lot of ambiguous notation here, especially as the usual mathematical notation for $n$th powers of trig functions is so misleading. I'm going to use $\circ$ for composition of functions and power series, and I'm going to use the notation $f^{\circ n}$ to mean the $n$th iterate of $f$. So $f^{n 1}(x) = f(x)f^n(x)$ and $f^{\circ n 1}(x) = f(f^{\circ n}(x))$. As I'll be working mostly in the ring of formal power series $R[\![z]\!]$ for some ring $R$, I'll reserve the variable $z$ to refer only to the corresponding element in this ring. I'll also use formal power series somewhat interchangeably with functions. So $z$ can be thought of as representing the identity function. To make sure we're on the same page, here are some small theorems in this notation:

$z^mz^n = z^{m n}$
$f^{\circ m}\circ f^{\circ n} = f^{\circ m n}$
$(1 z)^n = \sum_{i=0}^n{n\choose i}z^n$
$(1 z)^{\circ n}=n z$.

That last one simply says that adding one $n$ times is the same as adding $n$.

As I'm going to have ordinary logarithms and exponentials sitting around, as well as functional logarithms and exponentials, I'm going to introduce the notation $\operatorname{LOG}$ for functional logarithm and $\operatorname{EXP}$ for functional exponentiation.

Preliminaries

The first goal is to define a non-trivial function $\operatorname{LOG}$ with the fundamental property that $\operatorname{LOG}(f^{\circ n})=n\operatorname{LOG}(f)$

First, let's note some basic algebraic facts. The formal power series form a commutative ring with operations $+$ and $\cdot$ (ordinary multiplication) and with additive identity $0$ and multiplicative identity $1$. The formal power series form a ring-like algebraic structure with operation $+$ and partial operation $\circ$ with additive identity $0$ and multiplicative identity $z$. But it's not actually ring or even a near-ring. Composition isn't defined for all formal power series and even when it's defined, we don't have distributivity. For example, in general $f\circ(g h)\ne f\circ g f\circ h$, after all there's no reason to expect $f(g(x) h(x))$ to equal $f(g(x)) f(h(x))$. We do have right-distributivity however, i.e.

$(f g)\circ h = f\circ g f\circ h$,

because

$(f g)(h(x))=f(h(x)) g(h(x))$,

more or less by definition of $+$ .

We can't use power series on our power series

There's an obvious approach, just use power series of power series. So we might tentatively suggest that

$\operatorname{LOG}(z f) = f-\frac{1}{2}f^{\circ 2} \frac{1}{3}f^{\circ 3} \ldots$.

Note that I consider $\operatorname{LOG}(z f)$ rather than $\operatorname{LOG}(1 f)$ because $z$ is the multiplicative identity in our ring-like structure.

Unfortunately this doesn't work. The reason is this: if we try to use standard reasoning to show that the resulting function has the fundamental property we seek we end up using distributivity. We don't have distributivity.

Sleight of hand

There's a beautiful trick I spotted on mathoverflow recently that allows us to bring back distributivity. (I can't find the trick again, but when I do I'll come back and add a link and credit here.) Consider the function $R(g)$ defined by $R(g)(f) = f\circ g$. In other words $R(g)$ is right-composition by $g$. (Ambiguity alert, I'm using $R$ here to mean right. It has nothing to do with the ring underlying our formal power series.) Because we have right-distributivity, $R(g)$ is a bona fide linear operator on the space of formal power series. If you think of formal power series as being infinitely long vectors of coefficients then $R(g)$ can be thought of as an infinitely sized matrix. This means that as long as we have convergence, we can get away with using power series to compute $\log R(g)$ with the property that $\log(R(g)^n) = n\log R(g)$. Define:

$\operator{LOG}(f) = \log(R(f))z$.

We have:

$\operator{LOG}(f) = \log(R(f))z = \log(1 (R(f)-1))z$

where I'm using $1$ to mean the identity linear operator. And now have:

$\operator{LOG}(f) = (R(f)-1)z-\frac{1}{2}(R(f)-1)^2z \frac{1}{3}(R(f)-1)^3z \ldots$.

But does it converge? Suppose $f$ is of the form $x a_2x^2 a_3x^3 \ldots$. Then $(R(f)-1)g = g\circ f-g$. The leading term in $g\circ f$ is the same as the leading term in $g$. So $R(f)-1$ kills the first term of whatever it is applied to, which means that when we sum the terms in $\operatorname{LOG}(f)$, we only need $n$ to get a power series correct to $n$ coefficients. Reusing my code from here, I call $\operatorname{LOG}$ by the name flog. Here is its implementation:

> import Data.Ratio


> flog :: (Eq a, Fractional a) => [a] -> [a]
> flog f@(0 : 1 : _) =
>   flog' 1 (repeat 0) (0 : 1 : repeat 0)
>      where flog' n total term = take (n+1) total ++ (
>              drop (n+1) $
>                 let pz = p term
>                 in flog' (n+1) (total-map (((-1)^n / fromIntegral n) *) pz) pz)
>            p total = (total ○ f) - total

The take and drop are how I tell Haskell when the first $n 1$ coefficients have been exactly computed and so no more terms are necessary.

Does it work?

Here's an example using the twice iterated sin function:

> ex1 = do
>   let lhs = flog (sin (sin z))
>   let rhs = 2*flog (sin z)
>   mapM_ print $ take 20 (lhs-rhs)

Works to 20 coefficients. Dare we try an inverse function?

> ex2 = do
>   let lhs = flog (sin z)
>   let rhs = flog (asin z)
>   mapM_ print $ take 20 (lhs+rhs)

Seems to work!

Exponentials

It's no good having logarithms if we can't invert them. One way to think about the exponential function is that

$\exp(x) = \lim_{n\rightarrow \infty}(1 \frac{x}{n})^n$

We get better and better approximations by writing the expression inside the limit as a product of more and more terms. We can derive the usual power series for $\exp$ from this, but only if right-distributivity holds. So let's try to use the above expression directly:

$\operatorname{EXP}(f) = \lim_{n\rightarrow \infty}(z \frac{f}{n})^{\circ n}$

and get

$\operatorname{EXP}(f) = \lim_{n\rightarrow \infty}R(z \frac{f}{n})^nz$.

Unfortunately, even though $R(g)$ is linear, $R$ itself isn't. So it's going to take some extra work to raise $R(z f/n)$ to the power of $n$.

The good news is that we're dealing with the special case $R(z \epsilon)$ where $\epsilon$ is something small. We have

$R(z \epsilon)f=f(z \epsilon)=f(z) \epsilon\frac{df}{dz} O(\epsilon^2)$.

So $R(z f/n)$ is actually $1 \frac{1}{n}f\frac{d}{dz}$ modulo higher order terms. This gives us

$\operatorname{EXP}(f) = \lim_{n\rightarrow \infty}(1 \frac{1}{n}f\frac{d}{dz})^nz=\exp(f\frac{d}{dz})z$.

This is something we can implement using the power series for ordinary $\exp$:

$\operatorname{EXP}(f) = z f \frac{1}{2!}f\frac{df}{dz} \frac{1}{3!}f\frac{d}{dz}(f\frac{df}{dz}) \ldots$.

In code that becomes:

> fexp f@(0 : 0 : _) = fexp' f 0 z 1
> fexp' f total term n = take (n-1) total ++ drop (n-1)
>           (fexp' f (total+term) (map (/fromIntegral n) (f*d term)) (n+1))

Note how when we differentiate a power series we shift the coefficients down by one place. To counter the effect of that so as to ensure convergence we need $f$ to look like $a_2z^2 a_3a^3 \ldots$. Luckily this is exactly the kind of series $\operatorname{LOG}$ gives us.

But does it successfully invert $\operatorname{LOG}$? Let's try:

> ex3 = do
>   let lhs = sin z
>   let rhs = fexp (flog (sin z))
>   mapM_ print $ take 20 (lhs-rhs)

Now we can start computing fractional iterates. Square root first:

> ex4 = do
>   mapM_ print $ take 20 $ fexp (flog (sin z)/2)

That matches the results at A048602 and A048603.

Cube root:

> ex5 = do
>   mapM_ print $ take 20 $ fexp (flog (sin z)/3)

Matches A052132 and A052135.

And this gives an alternative to Lagrange inversion for computing power series for inverse functions:

> ex6 = do
>   let lhs = fexp (-flog (sin z))
>   let rhs = asin z
>   mapM_ print $ take 20 (lhs-rhs)

What's really going on with $\operatorname{EXP}$?

Let's approach $\operatorname{EXP}$ in a slightly different way. In effect, $\operatorname{EXP}$ is the composition of $n$ lots of $z \frac{f}{n}$ with $z$. So let's try composing these one at a time, with one composition every $\frac{1}{n}$ seconds. After one second we should have our final result. We can write this as:

$g(0) = z$ and $g(t \frac{1}{n}) = g(t) \frac{1}{n}f(g(t))$ to first order.

So we're solving the differential equation:

$g(0) = z$ and $\frac{dg}{dt} = f(g(t))$

with $\operatorname{EXP}(g) = g(1)$.

So $\operatorname{EXP}$ is the function that solves one of the most fundamental differential equations. This also means I can use Mathematica to solve symbolically and check my results. For example, Mathematica says that the solution to

$\frac{dg}{dt}=sin(g(t))^2$ and $g(0)=x$

at $t=1$ is

$g(1) = \frac{\tan z}{1-\tan z}$

so let's check:

> ex7 = do
>   let lhs = fexp ((sin z)^2)
>   let rhs = atan (tan z/(1-tan z))
>   mapM_ print $ take 20 (lhs-rhs)

I like this example because it leads to the generalized Catalan numbers A004148:

> ex8 = do
>     mapM_ print $ take 20 $ fexp (z^2/(1-z^2))

That suggests this question: what does $\operatorname{EXP}$ mean combinatorially? I don't have a straightforward answer but solving this class of differential equation motivated the original introduction, by Cayley, of the abstract notion of a tree. See here.

What is going on geometrically?

For those who know some differential geometry, The differential equation

$g(0) = z$ and $\frac{dg}{dt} = f(g(t))$

describes a flow on the real line (or complex plane). You can think of $f$ as being a one-dimensional vector field describing how points move from time $t$ to $t dt$. When we solve the differential equation we get integral curves that these points follow and $\operatorname{EXP}$ tells us where the points end up after one unit of time. So $\operatorname{EXP}$ is the exponential map. In fact, $\operatorname{EXP}(f)=\exp(f\frac{d}{dz})z$ is essentially the exponential of the vector field $f\frac{d}{dz}$ where we're now using the differential geometer's notion of a vector field as a differential operator.

Final word

Unfortunately the power series you get from using $\operator{LOG}$ and $\operator{EXP}$ don't always have good convergence properties. For example, I'm not sure but I think the series for $\sin^{\circ 1/2} z$ has radius of convergence zero. If you truncate the series you get a half-decent approximaion to a square root in the vicinity of the origin, but the approximation gets worse, not better, if you use more terms.

And the rest of the code

> (*!) _ 0 = 0
> (*!) a b = a*b
> (!*) 0 _ = 0
> (!*) a b = a*b
> (^+) a b = zipWith (+) a b
> (^-) a b = zipWith (-) a b


> ~(a:as) ⊗ (b:bs) = (a *! b):
>     ((map (a !*) bs) ^+ (as ⊗ (b:bs)))
> (○) (f:fs) (0:gs) = f:(gs ⊗ (fs ○ (0:gs)))
> inverse (0:f:fs) = x where x     = map (recip f *) (0:1:g)
>                            _:_:g    = map negate ((0:0:fs) ○ x)
> invert x = r where r = map (/x0)  ((1:repeat 0) ^- (r ⊗ (0:xs)))
>                    x0:xs = x 


> (^/) (0:a) (0:b) = a ^/ b
> (^/) a b = a ⊗ (invert b)


> z :: [Rational]
> z = 0:1:repeat 0


> d (_:x) = zipWith (*) (map fromInteger [1..]) x


> integrate x = 0 : zipWith (/) x (map fromInteger [1..])


> instance (Eq r, Num r) => Num [r] where
>     x+y  = zipWith (+) x y
>     x-y  = zipWith (-) x y
>     ~x*y = x ⊗ y
>     fromInteger x      = fromInteger x:repeat 0
>     negate x     = map negate x
>     signum (x:_) = signum x : repeat 0
>     abs (x:xs)   = error "Can't form abs of a power series"


> instance (Eq r, Fractional r) => Fractional [r] where
>     x/y = x ^/ y
>     fromRational x    = fromRational x:repeat 0


> sqrt' x = 1 : rs where rs = map (/2) (xs ^- (rs ⊗ (0:rs)))
>                        _ : xs = x
> instance (Eq r, Fractional r) => Floating [r] where
>     sqrt (1 : x) = sqrt' (1 : x)
>     sqrt _  = error "Can only find sqrt when leading term is 1"
>     exp x   = e where e = 1+integrate (e * d x)
>     log x   = integrate (d x/x)
>     sin x   = integrate ((cos x)*(d x))
>     cos x   = [1] ... negate (integrate ((sin x)*(d x)))
>     asin x  = integrate (d x/sqrt(1-x*x))
>     atan x  = integrate (d x/(1+x*x))
>     acos x  = error "Unable to form power series for acos"
>     sinh x  = integrate ((cosh x)*(d x))
>     cosh x  = [1] ... integrate ((sinh x)*(d x))
>     asinh x = integrate (d x/sqrt(1+x*x))
>     atanh x = integrate (d x/(1-x*x))
>     acosh x = error "Unable to form power series for acosh"
>     pi      = error "There is no formal power series for pi"


> lead [] x = x
> lead (a:as) x = a : (lead as (tail x))
> a ... x = lead a x


> (//) :: Fractional a => [a] -> (Integer -> Bool) -> [a]
> (//) a c = zipWith (\a-> \b->(if (c a :: Bool) then b else 0)) [(0::Integer)..] a

A direct functional square root that doesn't use $\operatorname{LOG}$ and $\operatorname{EXP}$:

> fsqrt (0 : 1 : fs) =
>     let gs = (fs-(0 : gs*((0 : delta gs gs)+((2 : gs)*(gs*g)))))/2
>         g = 0 : 1 : gs
>         delta (g : gs) h = let g' = delta gs h
>                    in (0 : ((1 : h) * g')) + gs
>     in g

tag:blogger.com,1999:blog-11295132.post-8266076036196212490

Extensions

Building free arrows from components

sigfpe Jan 9, 2017 Updated Jan 9, 2017

Show full content

Introduction

Gabriel Gonzalez has written quite a bit about the practical applications of free monads. And "haoformayor" wrote a great stackoverflow post on how arrows are related to strong profunctors. So I thought I'd combine these and apply them to arrows built from profunctors: free arrows. What you get is a way to use arrow notation to build programs, but defer the interpretation of those programs until later.

Heteromorphisms

Using the notation here I'm going to call an element of a type P a b, where P is a profunctor, a heteromorphism.

A product that isn't much of a product

As I described a while back you can compose profunctors. Take a look at the code I used, and also Data.Functor.Composition.

data Compose f g d c = forall a. Compose (f d a) (g a c)

An element of Compose f g d c is just a pair of heteromorphisms, one from each of the profunctors, f and g, with the proviso that the "output" type of one is compatible with the "input" type of the other. As products go it's pretty weak in the sense that no composition happens beyond the two objects being stored with each other. And that's the basis of what I'm going to talk about. The Compose type is just a placeholder for pairs of heteromorphisms whose actual "multiplication" is being deferred until later. This is similar to the situation with the free monoid, otherwise known as a list. We can "multiply" two lists together using mappend but all that really does is combine the elements into a bigger list. The elements themselves aren't touched in any way. That suggests the idea of using profunctor composition in the same way that (:) is used to pair elements and lists.

Free Arrows

Here's some code:

> {-# OPTIONS -W #-}
> {-# LANGUAGE ExistentialQuantification #-}
> {-# LANGUAGE Arrows #-}
> {-# LANGUAGE RankNTypes #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE FlexibleInstances #-}


> import Prelude hiding ((.), id)
> import Control.Arrow
> import Control.Category
> import Data.Profunctor
> import Data.Monoid


> infixr :-


> data FreeA p a b = PureP (a -> b)
>                  | forall x. p a x :- FreeA p x b

First look at the second line of the definition of FreeA. It says that a FreeA p a b might be a pair consisting of a head heteromorphism whose output matches the input of another FreeA. There's also the PureP case which is acting like the empty list []. The reason we use this is that for our composition, (->) acts a lot like the identity. In particular Composition (->) p a b is isomorphic to p a b (modulo all the usual stuff about non-terminating computations and so on). This is because an element of this type is a pair consisting of a function a -> x and a heteromorphism p x b for some type x we don't get to see. We can't project back out either of these items without information about the type of x escaping. So the only thing we can possibly do is use lmap to apply the function to the heteromorphism giving us an element of p a b.

Here is a special case of PureP we'll use later:

> nil :: Profunctor p => FreeA p a a
> nil = PureP id

So an element of FreeA is a sequence of heteromorphisms. If heteromorphisms are thought of as operations of some sort, then an element of FreeA is a sequence of operations waiting to be composed together into a program that does something. And that's just like the situation with free monads. Once we've build a free monad structure we apply an interpreter to it to evaluate it. This allows us to separate the "pure" structure representing what we want to do from the code that actually does it.

The first thing to note is our new type is also a profunctor. We can apply lmap and rmap to a PureP function straightforwardly. We apply lmap directly to the head of the list and we use recursion to apply rmap to the PureP at the end:

> instance Profunctor b => Profunctor (FreeA b) where
>     lmap f (PureP g) = PureP (g . f)
>     lmap f (g :- h) = (lmap f g) :- h
>     rmap f (PureP g) = PureP (f . g)
>     rmap f (g :- h) = g :- (rmap f h)

We also get a strong profunctor by applying first' all the way down the list:

> instance Strong p => Strong (FreeA p) where
>     first' (PureP f) = PureP (first' f)
>     first' (f :- g) = (first' f) :- (first' g)

We can now concatenate our lists of heteromorphisms using code that looks a lot like the typical implementation of (++):

> instance Profunctor p => Category (FreeA p) where
>     id = PureP id
>     g . PureP f = lmap f g
>     k . (g :- h) = g :- (k . h)

Note that it's slightly different to what you might have expected compared to (++) because we tend to write composition of functions "backwards". Additionally, there is another definition of FreeA we could have used that's analogous to using snoc lists instead of cons lists.

And now we have an arrow. I'll leave the proofs that the arrow laws are obeyed as an exercise :-)

> instance (Profunctor p, Strong p) => Arrow (FreeA p) where
>     arr = PureP
>     first = first'

The important thing about free things is that we can apply interpreters to them. For lists we have folds:

foldr :: (a -> b -> b) -> b -> [a] -> b

In foldr f e we can think of f as saying how (:) should be interpreted and e as saying how [] should be interpreted.

Analogously, in Control.Monad.Free in the free package we have:

foldFree :: Monad m => (forall x . f x -> m x) -> Free f a -> m a
foldFree _ (Pure a)  = return a
foldFree f (Free as) = f as >>= foldFree f

Given a natural transformation from f to m, foldFree extends it to all of Free f.

Now we need a fold for free arrows:

> foldFreeA :: (Profunctor p, Arrow a) =>
>              (forall b c.p b c -> a b c) -> FreeA p b c -> a b c
> foldFreeA _ (PureP g) = arr g
> foldFreeA f (g :- h) = foldFreeA f h . f g

It's a lot like an ordinary fold but uses the arrow composition law to combine the interpretation of the head with the interpretation of the tail.

"Electronic" components

Let me revisit the example from my previous article. I'm going to remove things I won't need so my definition of Circuit is less general here. Free arrows are going to allow us to define individual components for a circuit, but defer exactly how those components are interpreted until later.

I'll use four components this time: a register we can read from, one we can write from and a register incrementer, as well as a "pure" component. But before that, let's revisit Gabriel's article that gives some clues about how components should be built. In particular, look at the definition of TeletypeF:

data TeletypeF x
  = PutStrLn String x
  | GetLine (String -> x)
  | ExitSuccess

We use GetLine to read a string, and yet the type of GetLine k could be TeletypeF a for any a. The reason is that free monads work with continuations. Instead of GetLine returning a string to us, it's a holder for a function that says what we'd like to do with the string once we have it. That means we can leave open the question of where the string comes from. The function foldFree can be used to provide the actual string getter.

Free arrows are like "two-sided" free monads. We don't just provide a continuation saying what we'd like to do to our output. We also get to say how we prepare our data for input.

There's also some burden put on us. Free arrows need strong profunctors. Strong profunctors need to be able to convey extra data alongside the data we care about - that's what first' is all about. This means that even though Load is functionally similar to GetLine, it can't simply ignore its input. So we don't have Load (Int -> b), and instead have Load ((a, Int) -> b. Here is our component type:

> data Component a b = Load ((a, Int) -> b)
>                    | Store (a -> (b, Int))
>                    | Inc (a -> b)

The Component only knows about the data passing through, of type a and b. It doesn't know anything about how the data in the registers is stored. That's the part that will be deferred to later. We intend for Inc to increment a register. But as it doesn't know anything about registers nothing in the type of Inc refers to that. (It took a bit of experimentation for me to figure this out and there may be other ways of doing things. Often with code guided by category theory you can just "follow your nose" as there's one way that works and type checks. Here I found a certain amount of flexibility in how much you store in the Component and how much is deferred to the interpreter.)

I could implement the strong profunctor instances using various combinators but I think it might be easiest to understand when written explicitly with lambdas:

> instance Profunctor Component where
>     lmap f (Load g) = Load $ \(a, s) -> g (f a, s)
>     lmap f (Store g) = Store (g . f)
>     lmap f (Inc g) = Inc (g . f)


>     rmap f (Load g) = Load (f . g)
>     rmap f (Store g) = Store $ \a -> let (b, t) = g a
>                                      in  (f b, t)
>     rmap f (Inc g) = Inc (f . g)


> instance Strong Component where
>     first' (Load g) = Load $ \((a, x), s) -> (g (a, s), x)
>     first' (Store g) = Store $ \(a, x) -> let (b, t) = g a
>                                           in  ((b, x), t)
>     first' (Inc g) = Inc (first' g)

And now we can implement individual components. First a completely "pure" component:

> add :: Num a => FreeA Component (a, a) a
> add = PureP $ uncurry (+)

And now the load and store operations.

> load :: FreeA Component () Int
> load = Load (\(_, a) -> a) :- nil


> store :: FreeA Component Int ()
> store = Store (\a -> ((), a)) :- nil


> inc :: FreeA Component a a
> inc = Inc id :- nil

Finally we can tie it all together in a complete function using arrow notation:

> test = proc () -> do
>     () <- inc   -< ()
>     a  <- load  -< ()
>     b  <- load  -< ()
>     c  <- add   -< (a, b)
>     () <- store -< c


>     returnA -< ()

At this point, the test object is just a list of operations waiting to be executed. Now I'll give three examples of semantics we could provide. The first uses a state arrow type similar to the previous article:

> newtype Circuit s a b = C { runC :: (a, s) -> (b, s) }


> instance Category (Circuit s) where
>     id = C id
>     C f . C g = C (f . g)


> instance Arrow (Circuit s) where
>     arr f = C $ \(a, s) -> (f a, s)
>     first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)
>                                       in  ((b, x), t)

Here is an interpreter that interprets each of our components as an arrow. Note that this is where, among other things, we provide the meaning of the Inc operation:

> exec :: Component a b -> Circuit Int a b
> exec (Load g) = C $ \(a, s) -> (g (a, s), s)
> exec (Store g) = C $ \(a, _) -> g a
> exec (Inc g) = C $ \(a, s) -> (g a, s+1)

Here's a completely different interpreter that is going to make you do the work of maintaining the state used by the resgisters. You'll be told what to do! We'll use the Kleisli IO arrow to do the I/O.

> exec' :: Component a b -> Kleisli IO a b
> exec' (Load g) = Kleisli $ \a -> do
>     putStrLn "What is your number now?"
>     s <- fmap read getLine
>     return $ g (a, s)
> exec' (Store g) = Kleisli $ \a -> do
>     let (b, t) = g a
>     putStrLn $ "Your number is now " ++ show t ++ "."
>     return b
> exec' (Inc g) = Kleisli $ \a -> do
>     putStrLn "Increment your number."
>     return $ g a

The last interpreter is simply going to sum values associated to various components. They could be costs in dollars, time to execute, or even strings representing some kind of simple execution trace.

> newtype Labelled m a b = Labelled { unLabelled :: m }


> instance Monoid m => Category (Labelled m) where
>     id = Labelled mempty
>     Labelled a . Labelled b = Labelled (a `mappend` b)


> instance Monoid m => Arrow (Labelled m) where
>     arr _ = Labelled mempty
>     first (Labelled m) = Labelled m


> exec'' (Load _) = Labelled (Sum 1)
> exec'' (Store _) = Labelled (Sum 1)
> exec'' (Inc _) = Labelled (Sum 2)

Note that we can't assign non-trivial values to "pure" operations.

And now we execute all three:

> main = do
>     print $ runC (foldFreeA exec test) ((), 10)
>     putStrLn "Your number is 10." >> runKleisli (foldFreeA exec' test) ()
>     print $ getSum $ unLabelled $ foldFreeA exec'' test

Various thoughts

I don't know if free arrows are anywhere near as useful as free monads, but I hope I've successfully illustrated one application. Note that because arrow composition is essentially list concatenation it may be more efficient to use a version of Hughes lists. This is what the Cayley representation is about in the monoid notions paper. But it's easier to see the naive list version first. Something missing from here that is essential for electronics simulation is the possibility of using loops. I haven't yet thought too much about what it means to build instances of ArrowLoop freely.

Profunctors have been described as decategorised matrices in the sense that p a b, with p a profunctor, is similar to the matrix $P_{ab}$ . Or, if you're working in a context where you distinguish between co- and contravariant vectors, it's similar to $P_a^b$ . The Composition operation is a lot like the definition of matrix product. From this perspective, the FreeA operation is a lot like the function on matrices that takes $P$ to $(1-P)^{-1}$ . To work with ArrowLoop we need a trace-like operation.

One nice application of free monads is in writing plugin APIs. Users can write plugins that link to a small library based on a free monad. These can then be dynamically loaded and interpreted by an application at runtime, completely insulating the plugin-writer from the details of the application. You can think of it as a Haskell version of the PIMPL idiom. Free arrows might give a nice way to write plugins for dataflow applications.

People typically think of functors as containers. So in a free monad, each element is a container of possible futures. In a free arrow the relationship between the current heteromorphism and its "future" (and "past") is a bit more symmetrical. For example, for some definitions of P, a heteromorphism P a b can act on some as to give us some bs. But some definitions of P can run "backwards" and act on elements of b -> r to give us elements of a -> r. So when I use the words "input" and "output" above, you might not want to take them too literally.

tag:blogger.com,1999:blog-11295132.post-8137439008294368764

Extensions

Addressing Pieces of State with Profunctors

sigfpe Jan 7, 2017 Updated Jan 7, 2017

Show full content

Attempted segue

Since I first wrote about profunctors there has been quite a bit of activity in the area so I think it's about time I revisited them. I could just carry on from where I left off 5 years ago but there have been so many tutorials on the subject that I think I'll have to assume you've looked at them. My favourite is probably Phil Freeman's Fun with Profunctors. What I intend to do here is solve a practical problem with profunctors.

The problem

Arrows are a nice mechanism for building circuit-like entities in code. In fact, they're quite good for simulating electronic circuits. Many circuits are very much like pieces of functional code. For example an AND gate like this

can be nicely modelled using a pure function: c = a && b. But some components, like flip-flops, have internal state. What comes out of the outputs isn't a simple function of the inputs right now, but depends on what has happened in the past. (Alternatively you can take the view that the inputs and outputs aren't the current values but the complete history of the values.)

We'll use (Hughes) arrows rather than simple functions. For example, one kind of arrow is the Kleisli arrow. For the case of Kleisli arrows built from the state monad, these are essentially functions of type a -> s -> (b, s) where s is our state. We can write these more symmetrically as functions of type (a, s) -> (b, s). We can think of these as "functions" from a to b where the output is allowed to depend on some internal state s. I'll just go ahead and define arrows like this right now.

First the extensions and imports:

> {-# OPTIONS -W #-}
> {-# LANGUAGE Arrows #-}
> {-# LANGUAGE RankNTypes #-}
> {-# LANGUAGE FlexibleInstances #-}


> import Prelude hiding ((.), id)
> import Control.Arrow
> import Control.Category
> import Data.Profunctor
> import Data.Tuple

And now I'll define our stateful circuits. I'm going to make these slightly more general than I described allowing circuits to change the type of their state:

> newtype Circuit s t a b = C { runC :: (a, s) -> (b, t) }


> instance Category (Circuit s s) where
>     id = C id
>     C f . C g = C (f . g)


> instance Arrow (Circuit s s) where
>     arr f = C $ \(a, s) -> (f a, s)
>     first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)
>                                       in  ((b, x), t)

This is just a more symmetrical rewrite of the state monad as an arrow. The first method allows us to pass through some extra state, x, untouched.

Now for some circuit components. First the "pure" operations, a multiplier and a negater:

> mul :: Circuit s s (Int, Int) Int
> mul = C $ \((x, y), s) -> (x*y, s)


> neg :: Circuit s s Int Int
> neg = C $ \(x, s) -> (-x, s)

And now some "impure" ones that read and write some registers as well as an accumulator:

> store :: Circuit Int Int Int ()
> store = C $ \(x, _) -> ((), x)


> load :: Circuit Int Int () Int
> load = C $ \((), s) -> (s, s)


> accumulate :: Circuit Int Int Int Int
> accumulate = C $ \(a, s) -> (a, s+a)

I'd like to make a circuit that has lots of these components, each with its own state. I'd like to store all of these bits of state in a larger container. But that means that each of these components needs to have a way to address its own particular substate. That's the problem I'd like to solve.

Practical profunctor optics

In an alternative universe lenses were defined using profunctors. To find out more I recommend Phil Freeman's talk that I linked to above. Most of the next paragraph is just a reminder of what he says in that talk and I'm going to use the bare minimum to do the job I want.

Remember that one of the things lenses allow you to do is this: suppose we have a record s containing a field of type a and another similar enough kind of record t with a field of type b. Among other things, a lens gives a way to take a rule for modifying the a field to a b field and extend it to a way to modify the s record into a t record. So we can think of lenses as giving us functions of type (a -> b) -> (s -> t). Now if p is a profunctor then you can think of p a b as being a bit function-like. Like functions, profunctors typically (kinda, sorta) get used to consume (zero or more) objects of type a and output (zero or more) objects of type b. So it makes sense to ask our lenses to work with these more general objects too, i.e. we'd like to be able to get something of type p a b -> p s t out of a lens. A strong profunctor is one that comes pre-packed with a lens that can do this for the special case where the types s and t are 2-tuples. But you can think of simple records as being syntactic sugar for tuples of fields, so strong profunctors also automatically give us lenses for records. Again, watch Phil's talk for details.

So here is our lens type:

> type Lens s t a b = forall p. Strong p => p a b -> p s t

Here are lenses that mimic the well known ones from Control.Lens:

> _1 :: Lens (a, x) (b, x) a b
> _1 = first'


> _2 :: Lens (x, a) (x, b) a b
> _2 = dimap swap swap . first'

(Remember that dimap is a function to pre- and post- compose a function with two others.)

Arrows are profunctors. So Circuit s s, when wrapped in WrappedArrow, is a profunctor. So now we can directly use the Circuit type with profunctor lenses. This is cool, but it doesn't directly solve our problem. So we're not going to use this fact. We're interested in addressing the state of type s, not the values of type a and b passed through our circuits. In other words, we're interested in the fact that Circuit s t a b is a profunctor in s and t, not a and b. To make this explicit we need a suitable way to permute the arguments to Circuit:

> newtype Flipped p s t a b = F { unF :: p a b s t }

(It was tempting to call that ComedyDoubleAct.)

And now we can define:

> instance Profunctor (Flipped Circuit a b) where
>     lmap f (F (C g)) = F $ C $ \(a, s) -> g (a, f s)
>     rmap f (F (C g)) = F $ C $ \(a, s) -> let (b, t) = g (a, s)
>                                           in  (b, f t)


> instance Strong (Flipped Circuit a b) where
>     first' (F (C g)) = F $ C $ \(a, (s, x)) -> let (b, t) = g (a, s)
>                                                in  (b, (t, x))

Any time we want to use this instance of Profunctor with a Circuit we have to wrap everything with F and unF. The function dimap gives us a convenient way to implement such wrappings.

Let's implement an imaginary circuit with four bits of state in it.

Here is the state:

> data CPU = CPU { _x :: Int, _y :: Int, _z :: Int, _t :: Int } deriving Show

As I don't have a complete profunctor version of a library like Control.Lens with its template Haskell magic I'll set things up by hand. Here's a strong-profunctor-friendly version of the CPU and a useful isomorphism to go with it:

> type ExplodedCPU = (Int, (Int, (Int, Int)))


> explode :: CPU -> ExplodedCPU
> explode (CPU u v w t) = (u, (v, (w, t)))


> implode :: ExplodedCPU -> CPU
> implode (u, (v, (w, t))) = CPU u v w t

And now we need adapters that take lenses for an ExplodedCPU and (1) apply them to a CPU the way Control.Lens would...

> upgrade :: Profunctor p =>
>            (p a a -> p ExplodedCPU ExplodedCPU) ->
>            (p a a -> p CPU CPU)
> upgrade f = dimap explode implode . f


> x, y, z, t :: Flipped Circuit a b Int Int -> Flipped Circuit a b CPU CPU
> x = upgrade _1
> y = upgrade $ _2 . _1
> z = upgrade $ _2 . _2 . _1
> t = upgrade $ _2 . _2 . _2

...and (2) wrap them so they can be used on the flipped profunctor instance of Circuit:

> (!) :: p s t a b -> (Flipped p a b s t -> Flipped p a b s' t') ->
>        p s' t' a b
> x ! f = dimap F unF f x

After all that we can now write a short piece of code that represents our circuit. Notice how we can apply the lenses x, ..., t directly to our components to get them to use the right pieces of state:

> test :: Circuit CPU CPU () ()
> test = proc () -> do
>     a  <- load ! x       -< ()
>     b  <- load ! y       -< ()
>     c  <- mul            -< (a, b)
>     d  <- neg            -< c
>     e  <- accumulate ! t -< d
>     () <- store ! z      -< e


>     returnA              -< ()


> main :: IO ()
> main = do
>     print $ runC test ((), CPU 2 30 400 5000)

Of course with a suitable profunctor lens library you can do a lot more, like work with traversable containers of components.

Note that we could also write a version of all this code using monads instead of arrows. But it's easier to see the symmetry in Flipped Circuit when using arrows, and it also sets the scene for the next thing I want to write about...

tag:blogger.com,1999:blog-11295132.post-1274672457856853383

Extensions

Expectation-Maximization with Less Arbitrariness

sigfpe Oct 16, 2016 Updated Apr 22, 2026

Show full content

Introduction

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

There are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are:

Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?
Substituting $1=A/A$ in the middle of an expression. Again, you can use $1=A/A$ just about anywhere. Why choose this $A$ at this time? Similarly I found derivations that insert a $B-B$ into an expression.
Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.

My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary. There's nothing original here, I'm merely expanding on a stackexchange answer.

Generalities about EM

The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe.

First assume no hidden variables. We assume there is a vector of parameters $\theta=(\theta_i)$ that defines some model. We make some observations $x=(x_j)$. We have a probability density $P(x|\theta)$ that depends on $\theta$. The likelihood of $\theta$ given the observations $x$ is $l(\theta|x)=P(x|\theta)$. The maximum likelhood estimator for $\theta$ is the choice of $\theta$ that maximises $l(\theta|x)$ for the $x$ we have observed.

Now suppose there are also some variables $z=(z_k)$ that we didn't get to observe. We assume a density $P(x,z|\theta)$. We now have

$P(x|\theta)=\sum_z P(x,z|\theta)$

where we sum over all possible values of $z$. The MLE approach says we now need to maximise

$l(\theta|x)=\sum_z P(x,z|\theta).$

One of the things that is a challenge here is that the components of $\theta$ might be mixed up among the terms in the sum. If, instead, each term only referred to its own unique block of $\theta_i$, then the maximisation would be easier as we could maximise each term independently of the others. Here's how we might move in that direction. Consider instead the log-likelihood

$\log l(\theta|x)=\log\sum_z P(x,z|\theta).$

Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise

$\sum_z \log P(x,z|\theta).$

One reason this would be to our advantage is that $P(x,z|\theta)$ often takes the form $\exp(f(x,z,\theta))$ where $f$ is a simple function to optimise. In addition, $f$ may break up as a sum of terms, each with its own block of $\theta_i$'s. Moving the logarithm inside the sum would give us something we could easily maximise term by term. What's more, the $P(x,z|\theta)$ for each $z$ is often a standard probability distribution whose likelihood we already know how to maximise. But, of course, we can't just move that logarithm in.

Maximisation by proxy

Sometimes a function is too hard to optimise directly. But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead. That will give us a new guess and we can continue from there. This is the basis of gradient descent. Suppose $f$ is a differentiable function in a neighbourhood of $x_0$. Then around $x_0$ we have

$f(x) \approx f(x_0) f'(x_0)\cdot (x-x_0).$

We can try optimising $f(x_0) f'(x_0)\cdot (x-x_0)$ with respect to $x$ within a neighbourhood of $x_0$. If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent. (Note that picking a circular neighbourhood is itself a somewhat arbitrary step, but that's another story.) For gradient descent we're choosing $f(x_0) f'(x_0)\cdot (x-x_0)$ because it matches both the value and derivatives of $f$ at $x_0$. We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration.

We want our logarithm of a sum to be a sum of logarithms. But instead we'll settle for a proxy function that is a sum of logarithms. We'll make the derivatives of the proxy match those of the original function precisely so we're not making an arbitrary choice.

Write

$\log l(\theta|x) = \log\sum_z P(x,z|\theta) \approx \sum_z\beta_z\log P(x,z|\theta) \text{constant}.$

The $\beta_z$ are constants we'll determine. We want to match the derivatives on either side of the $\approx$ at $\theta=\theta_0$:

$\frac{\partial \log l(\theta_0|x)}{\partial\theta_0}$ $=\frac{1}{l(\theta_0|x)} \frac{\partial l(\theta_0|x)}{\partial\theta_0} =\sum_z\frac{1}{l(\theta_0|x)} \frac{\partial P(x,z|\theta_0)}{\partial\theta_0}.$

On the other hand we have

$\frac{\partial}{\partial\theta_0}\sum_z\beta_z\log P(x,z|\theta_0) =\sum_z\beta_z\frac{1}{P(x,z|\theta_0)}\frac{\partial P(x,z|\theta_0)}{\partial\theta_0}$

To achieve equality we want to make these expressions match. We choose

$\beta_z = \frac{P(x,z|\theta_0)}{l(\theta_0|x)} = \frac{P(x,z|\theta_0)}{P(x|\theta_0)} = P(z|x,\theta_0).$

Our desired proxy function is:

$\sum_z P(z|x,\theta_0)\log P(x,z|\theta) + \text{const.} = E_{Z|x,\theta_0}(\log P(x,Z|\theta)) + \text{const.}$

So the procedure is to take an estimated $\theta_0$ and obtain a new estimate by optimising this proxy function with respect to $\theta$. This is the standard EM algorithm.

It turns out that this proxy has some other useful properties. For example, because of the concavity of the logarithm, the proxy is always smaller than the original likelihood. This means that when we optimise it we never optimise ``too far'' and that progress optimising the proxy is always progress optimising the original likelihood. But I don't need to say anything about this as it's all part of the standard literature.

Afterword

As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute

$\operatorname{argmax}_x\sum_i\exp(f_i(x))$

you can iterate, at each step computing

$\operatorname{argmax}_x\sum_i\exp(f_i(x_0))f_i(x)$

where $x_0$ is the previous iteration. If the $f_i$ take a convenient form then this may turn out to be much easier.

Note

This was originally written as a PDF using LaTeX. It'll be available here for a while. Some fidelity was lost when converting it to HTML.

tag:blogger.com,1999:blog-11295132.post-751712815454057762

Extensions

Dimensionful Matrices

sigfpe Aug 7, 2016 Updated Aug 7, 2016

Show full content

Introduction

Programming languages and libraries for numerical work tend not to place a lot of emphasis on the types of their data. For example Matlab, R, Octave, Fortran, and Numpy (but not the now defunct Fortress) all tend to treat their data as plain numbers meaning that any time you have a temperature and a mass, say, there is nothing to prevent you adding them.

I've been wondering how much dimensions (in the sense of dimensional analysis) and units could help with numerical programming. As I pointed out on G+ recently (which is where I post shorter stuff these days), you don't have to limit dimensions to the standard ones of length, mass, time, dollars and so on. Any scale invariance in the equations you're working with can be exploited as a dimension giving you a property that can be statically checked by a compiler.

There are quite a few libraries to statically check dimensions and units now. For example Boost.Units for C++, units for Haskell and even quantities for Idris.

A matrix that breaks things

Even if a language supports dimensions, it's typical to define objects like vectors and matrices as homogeneous containers of quantities. But have a look at the Wikipedia page on the metric tensor. There is a matrix

$\eta=\begin{pmatrix}-c^2&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\end{pmatrix}$

which has the curious property that 3 entries on the diagonal seem to be dimensionless while the first entry is a squared velocity with dimension $L^2T^{-2}$ . This will break many libraries that support units. An obvious workaround is to switch to use natural units, which is much the same as abandoning the usefulness of dimensions. But there's another way, even if it may be tricky to set up with existing languages.

Heterogeneous vectors and matrices

According to a common convention in physics, a 4-vector $x=(t=x^0,x^1,x^2,x^3)$ has dimensions $[x]=(T, L, L, L)$ where I'm using the convention that we can represent the units of a vector or matrix simply as a vector or matrix of dimensions, and here $T$ is time and $L$ is length. The metric tensor is used like this: $ds^2=x^i\eta_{ij}x^j$ (where I'm using the Einstein summation convention so the $i$ 's and $j$ 's are summed over). If we think of $ds^2$ having units of length squared (it is a pseudo-Riemannian metric after all) then it makes sense to think of $\eta$ having dimensions given by

$[\eta]=\begin{pmatrix}L^2T^{-2}&LT^{-1}&LT^{-1}&LT^{-1}\\LT^{-1}&1&1&1\\LT^{-1}&1&1&1\\LT^{-1}&1&1&1\end{pmatrix}$

We can write this more succinctly as

$[\eta]=(LT^{-1},1,1,1)\otimes(LT^{-1},1,1,1)$

where $\otimes$ is the usual outer product.

I'll use the notation $a:A$ to mean $a$ is of type $A$ . So, for example, $(t,x^1,x^2,x^3):(T,L,L,L)$ . I'll also use pointwise notation for types such as $(A,B,C,D)*(E,F,G,H)=(AE,BF,CG,DH)$ and $(A,B,C)^{-1}=(A^{-1},B^{-1},C^{-1})$ .

Now I can give some general rules. If $m:M$ is a matrix, $x:X$ and $y:Y$ are vectors, and $s:S$ is a scalar, then $y=mx$ only makes sense if $M=Y\otimes X^{-1}$ . Similarly the "inner product" $x^Tmy=s$ only makes sense if $M=X^{-1}\otimes Y^{-1}S$ .

Generic vectors and matrices

Although these kinds of types might be useful if you're dealing with the kind of heterogeneous matrices that appear in relativity, there's another reason they might be useful. If you write code (in the imaginary language that supports these structures and understands dimensions and units) to be as generic as possible in the types of the vector and matrix entries, failures to type check will point out parts of the code where there are hidden assumptions, or even errors, about scaling. For example, consider a routine to find the inverse of a 3 by 3 matrix. Writing this generically as possible means we should write it to operate on a matrix of type $(A,B,C)\otimes(D,E,F)$ , say. The result should have type $(D,E,F)^{-1}\otimes(A,B,C)^{-1}$ . If this type checks when used with a suitably powerful type checker then it means that if we replace the units for type A, say, with units twice as large, it should have no effect on the result, taking into account those units. In this case, it means that if we multiply the numbers of the first row of the input by 0.5 then the numbers of the first column of the output should get multiplied by 2. In fact this is a basic property of matrix inverses. In other words, this mathematical property of matrix inverses is guaranteed by a type system that can handle units and heterogeneous matrices. It would be impossible to write a matrix inverter that type checks and fails to have this property. Unfortunately it's still possible to write a matrix inverter that type checks and is incorrect some other way. Nonetheless this kind of type system would put a very big constraint on the code and is likely to eliminate many sources of error.

An example, briefly sketched

I thought I'd look at an actual example of a matrix inverter to see what would happen if I used a type checker like the one I've described. I looked at the conjugate gradient method. At the Wikipedia page, note the line

$\alpha_k:=\frac{\mathbf{r}_k^T\mathbf{r}_{k+1}}{\mathbf{p}_k^T\mathbf{Ap}_k}$

This would immediately fail to type check because if $r$ is of generic vector type $R=(R_1,\ldots,R_n)$ then $R_1^2$ isn't the same type as $R_2^2$ so they can't be added. I won't go into any of the details but the easiest way to patch up this code to make it type check is to introduce a new matrix $P$ of type $R^{-1}\otimes R^{-1}$ and besides using it to make this inner product work (replacing the numerator by $\mathbf{r}_k^T P\mathbf{r}_{k+1}$ ) we also use $P$ anywhere in the code we need to convert a vector of type $R$ to a vector of type $R^{-1}$ . If you try to do this as sparingly as possible you'll end up with a modified algorithm. But at first this seems weird. Why should this matrix inverse routine rely on someone passing in a second matrix to make it type check? And what is this new algorithm anyway? Well scroll down the Wikipedia page and you get to the preconditioned conjugate gradient algorithm. The extra matrix we need to pass in is the preconditioner. This second algorithm would type check. Preconditioned conjugate gradient, with a suitable preconditioner, generally performs better than pure conjugate gradient. So in this case we're getting slightly more than a check on our code's correctness. The type checker for our imaginary language would give a hint on how to make the code perform better. There's a reason for this. The original conjugate gradient algorithm is implicitly making a choice of units that sets scales along the axes. These determine the course taken by the algorithm. It's not at all clear that picking these scalings randomly (which is in effect what you're doing if you throw a random problem at the algorithm) is any good. It's better to pick a preconditioner adapted to the scale of the problem and the type checker is hinting (or would be if it existed) that you need to do this. Compare with the gradient descent algorithm whose scaling problems are better known.

But which language?

I guess both Agda and Idris could be made to implement what I've described. However, I've a hunch it might not be easy to use in practice.

tag:blogger.com,1999:blog-11295132.post-1593552387558031235

Extensions

Cofree meets Free

sigfpe May 24, 2014 Updated May 24, 2014

Show full content

> {-# LANGUAGE RankNTypes, MultiParamTypeClasses, TypeOperators #-}

Introduction

After I spoke at BayHac 2014 about free monads I was asked about cofree comonads. So this is intended as a sequel to that talk. Not only am I going to try to explain what cofree comonads are. I'm also going to point out a very close relationship between cofree comonads and free monads.

At the beginning of the talk the Google Hangout software seems to have switched to the laptop camera so you can't see the slides in the video. However the slides are here.

Cothings as machines

I often think of coalgebraic things as machines. They have some internal state and you can press buttons to change that internal state. For example here is a type class for a machine with two buttons that's related to a magma:

> class TwoButton a where
>   press :: a -> (a, a)

The idea is that the state of the machine is given by some type a and you could press either the left button or the right button. The result of pressing one or other button is given by these two functions:

> pressLeft, pressRight :: TwoButton a => a -> a
> pressLeft = fst . press
> pressRight = snd . press

(As with many metaphors used to explain Haskell type classes your mileage may vary. Sometimes you'll have to stretch your imagination to see what the set of buttons is for a particular cothing.)

Comonads

Just as monads are a kind of generalised algebraic structure (for example see my talk), comonads are a generalised kind of machine. The idea is that for any state of the machine there is a bunch of buttons we could press. But we don't have two buttons, or any fixed number of buttons. We instead have a functorful of buttons (if you think of functors by analogy with containers). We also don't get to directly see the internal state of the machine but instead we get to make observations.

Here's the type class:

> class Comonad w where
>   extract :: w a -> a
>   duplicate :: w a -> w (w a)

The state of the machine is given by w a. We observe the state using the extract function. And when we come to press a button, we have a functorful of new states that it could end up in. The duplicate function gives the container of those new states.

For example, various kinds of zipper give rise to comonads. Zippers allow you to "focus" on a part of a data structure. The extract operation allows you to observe the point that currently has focus. There is one button for every position in the structure where the focus could be. Pressing the corresponding button moves the focus to that point. Similarly the Store comonad has one button for each value you can store in the field it represents. Press the button and the value gets stored in the field.

Cofreeness as a way to memoise

Cofree coalgebras can be thought of as memoised forms of elements of coalgebras. For example, the TwoButton machine above has a function, press, as part of its definition. Memoising an element of such a thing means tabulating everything that could possibly happen if you pressed the buttons so we no longer need the press function. One approach is to try something like this:

data CofreeTwoButton = Memo CofreeTwoButton CofreeTwoButton

The structure contains two CofreeTwoButtons, each giving the result of pressing one of the two buttons. Any element of CofreeTwoButton may now be memoised like so:

memoiseTwoButton :: TwoButton m => m -> CofreeTwoButton
memoiseTwoButton m = Memo (memoiseTwoButton (pressLeft m)) (memoiseTwoButton (pressRight m))

It definitely tabulates the result of pressing buttons. But it has a major flaw. We have no way of seeing what's stored in the table! To make this useful we want to also store some data in the table that we can peek at. So here is a better definition:

> data CofreeTwoButton a = Memo a (CofreeTwoButton a) (CofreeTwoButton a)
> memoiseTwoButton :: TwoButton m => (m -> a) -> m -> CofreeTwoButton a
> memoiseTwoButton f m = Memo (f m) (memoiseTwoButton f (pressLeft m)) (memoiseTwoButton f (pressRight m))

The first argument to memoiseTwoButton says what we want to store in the table and then memoiseTwoButton goes ahead and stores it. We can use the identity function if we want to store the original elements.

Note how this is like foldMap:

foldMap :: Monoid m => (a -> m) -> t a -> m

if we replace t by the list functor and remember that lists are free monoids. The main difference is that arrows have been reversed. Where foldMap takes an element of a free monoid and interprets it as an element of another monoid, memoiseTwoButton packs an element of a TwoButton into a cofree structure. The "interpretation" and "packing" here are both homomorphisms for their respective structures. Homomorphisms respect equations so if an equation holds between elements of a free monoid we expect it to also hold when interpreted in another monoid. But any element of a free monoid can be interpreted in any other monoid meaning that any equation that holds between elements of a free monoid must hold in any monoid. That's why free monoids are designed so that the only equations that hold between elements are those that follow from the monoid laws.

With the TwoButton we have a dualised version of the above. Every element of every TwoButton can be packed into the CofreeTwoButton. So every equation in the original structure will still hold after the packing. So every equation that holds in some TwoButton must have some solution in CofreeTwoButton. That gives an idea of what a CofreeTwoButton is by analogy with the free monoid.

Cofree comonads

A cofree comonad is basically a memoised comonad. So the data structure is:

> data Cofree f a = Cofree a (f (Cofree f a))

At each point in the "table" we store some observable value of type a. And we have a functorful of buttons, so we expect to have a functorful of new states we could transition to. The Functor instance looks like:

> instance Functor f => Functor (Cofree f) where
>   fmap f (Cofree a fs) = Cofree (f a) (fmap (fmap f) fs)

We apply f to the observable value and then push the fmap f down to the child nodes.

The duplicate function takes a memoised state and replaces the observable stored at each position with the memoised state that gives rise to the observable.

> instance Functor f => Comonad (Cofree f) where
>   extract (Cofree a _) = a
>   duplicate c@(Cofree _ fs) = Cofree c (fmap duplicate fs)

Now by analogy with memoiseTwoButton we can memoise comonads.

> memoiseComonad :: (Comonad w, Functor f) =>
>                   (forall x.w x -> f x) -> (forall x.w x -> Cofree f x)
> memoiseComonad f w = Cofree (extract w) (fmap (memoiseComonad f) (f (duplicate w)))

So that's what a cofree comonad is: it's a type that can be used to memoise all of the states that are accessible from a state in a comonad by pressing its buttons.

Cofree comonad meets free monad

But that's not all. There is a close relationship between cofree comonads and free monads. So to get going, here's a free monad type:

> data Free f a = Id a | Free (f (Free f a))


> join' :: Functor f => Free f (Free f a) -> Free f a
> join' (Id x) = x
> join' (Free fa) = Free (fmap join' fa)


> instance Functor f => Functor (Free f) where
>   fmap f (Id x) = Id (f x)
>   fmap f (Free fa) = Free (fmap (fmap f) fa)


> instance Functor f => Monad (Free f) where
>   return = Id
>   m >>= f = join' (fmap f m)

Now I'll define a kind of pairing between functors. Given a way to combine two kinds of element, the pairing gives a way to combine a pair of containers of those elements.

> class (Functor f, Functor g) => Pairing f g where
>   pair :: (a -> b -> r) -> f a -> g b -> r


> data Identity a = Identity a
> instance Functor Identity where
>   fmap f (Identity x) = Identity (f x)


> instance Pairing Identity Identity where
>   pair f (Identity a) (Identity b) = f a b


> data (f :+: g) x = LeftF (f x) | RightF (g x)
> instance (Functor f, Functor g) => Functor (f :+: g) where
>   fmap f (LeftF x) = LeftF (fmap f x)
>   fmap f (RightF x) = RightF (fmap f x)


> data (f :*: g) x = f x :*: g x
> instance (Functor f, Functor g) => Functor (f :*: g) where
>   fmap f (x :*: y) = fmap f x :*: fmap f y


> instance (Pairing f f', Pairing g g') => Pairing (f :+: g) (f' :*: g') where
>   pair p (LeftF x) (a :*: _) = pair p x a
>   pair p (RightF x) (_ :*: b) = pair p x b


> instance (Pairing f f', Pairing g g') => Pairing (f :*: g) (f' :+: g') where
>   pair p (a :*: _) (LeftF x) = pair p a x
>   pair p (_ :*: b) (RightF x) = pair p b x


> instance Pairing ((->) a) ((,) a) where
>   pair p f = uncurry (p . f)

Given a pairing between f and g we get one between Cofree f and Free g.

> instance Pairing f g => Pairing (Cofree f) (Free g) where
>   pair p (Cofree a _) (Id x) = p a x
>   pair p (Cofree _ fs) (Free gs) = pair (pair p) fs gs

An element of Free g can be thought of as an expression written in a DSL. So this pairing gives a way to apply a monadic expression to a memoised comonad. In other words, if you think of comonads as machines, monads give a language that can be used to compute something based on the output of the machine.

Here's an almost trivial example just so you can see everything working together. A reasonable definition of a comagma structure on the type a is a -> UpDown a with UpDown defined as:

> data UpDown a = Up a | Down a


> instance Functor UpDown where
>   fmap f (Up a) = Up (f a)
>   fmap f (Down a) = Down (f a)


> type CofreeComagma a = Cofree UpDown a

A well known comagma structure on the positive integers is given by the famous Collatz conjecture:

> collatz :: Integer -> UpDown Integer
> collatz n = if even n then Down (n `div` 2) else Up (3*n+1)

We can memoise this as a cofree comonad:

> memoisedCollatz :: Integer -> CofreeComagma Integer
> memoisedCollatz n = Cofree n (fmap memoisedCollatz (collatz n))

Here's a picture of memoisedCollatz 12:

Now let's make the dual functor in readiness for building the dual monad:

> data Two a = Two a a
> instance Functor Two where
>   fmap f (Two a b) = Two (f a) (f b)

And here we set up a pairing:

> instance Pairing UpDown Two where
>   pair f (Up a) (Two b _) = f a b
>   pair f (Down a) (Two _ c) = f a c


> execute :: Cofree UpDown x -> Free Two (x -> r) -> r
> execute w m = pair (flip ($)) w m

This gives rise to a free monad isomorphic to the one in my talk:

> data Direction = WentUp | WentDown deriving Show


> choose :: Free Two Direction
> choose = Free (Two (return WentUp) (return WentDown))

And here's an example of some code written in the corresponding DSL:

> ex1 :: Free Two (Integer -> String)
> ex1 = do
>   x <- choose
>   y <- choose
>   case (x, y) of
>       (WentDown, WentDown) -> return (\z -> "Decreased twice " ++ show z)
>       _ -> return show

It can be represented as:

And here's what happens when they meet:

> go1 :: String
> go1 = execute (memoisedCollatz 12) ex1

This can be understood through the combined picture:

References

On getting monads from comonads more generally see Monads from Comonads. For more on memoising and how it's really all about the Yoneda lemma see Memoizing Polymorphic Functions. I'm waiting for Tom Leinster to publish some related work. The pairing above gives a way for elements of free monads to pick out elements of cofree comonads and is a special case of what I'm talking about here. But I think Tom has some unpublished work that goes further.

If you think of a comonad as a compressed object that is decompressed by a monadic decision tree, then you'd expect some form of information theoretical description to apply. That makes me think of Convex spaces and an operadic approach to entropy.

tag:blogger.com,1999:blog-11295132.post-1797157221719257594

Extensions

Types, and two approaches to problem solving

sigfpe May 17, 2014 Updated May 17, 2014

Show full content

Introduction There are two broad approaches to problem solving that I see frequently in mathematics and computing. One is attacking a problem via subproblems, and another is attacking a problem via quotient problems. The former is well known though I’ll give some examples to make things clear. The latter can be harder to recognise but there is one example that just about everyone has known since infancy.

Subproblems Consider sorting algorithms. A large class of sorting algorithms, including quicksort, break a sequence of values into two pieces. The two pieces are smaller so they are easier to sort. We sort those pieces and then combine them, using some kind of merge operation, to give an ordered version of the original sequence. Breaking things down into subproblems is ubiquitous and is useful far outside of mathematics and computing: in cooking, in finding our path from A to B, in learning the contents of a book. So I don’t need to say much more here.

Quotient problems The term quotient is a technical term from mathematics. But I want to use the term loosely to mean something like this: a quotient problem is what a problem looks like if you wear a certain kind of filter over your eyes. The filter hides some aspect of the problem that simplifies it. You solve the simplified problem and then take off the filter. You now ‘lift’ the solution of the simplified problem to a solution to the full problem. The catch is that your filter needs to match your problem so I’ll start by giving an example where the filter doesn’t work.

Suppose we want to add a list of integers, say: 123, 423, 934, 114. We can try simplifying this problem by wearing a filter that makes numbers fuzzy so we can’t distinguish numbers that differ by less than 10. When we wear this filter 123 looks like 120, 423 looks like 420, 934 looks like 930 and 114 looks like 110. So we can try adding 120+420+930+110. This is a simplified problem and in fact this is a common technique to get approximate answers via mental arithmetic. We get 1580. We might hope that when wearing our filters, 1580 looks like the correct answer. But it doesn’t. The correct answer is 1594. This filter doesn’t respect addition in the sense that if a looks like a’ and b looks like b’ it doesn’t follow that a+b looks like a’+b’.

To solve a problem via quotient problems we usually need to find a filter that does respect the original problem. So let’s wear a different filter that allows us just to see the last digit of a number. Our original problem now looks like summing the list 3, 3, 4, 4. We get 4. This is the correct last digit. If we now try a filter that allows us to see just the last two digits we see that summing 23, 23, 34, 14 does in fact give the correct last two digits. This is why the standard elementary school algorithms for addition and multiplication work through the digits from right to left: at each stage we’re solving a quotient problem but the filter only respects the original problem if it allows us to see the digits to the right of some point, not digits to the left. This filter does respect addition in the sense that if a looks like a’ and b looks like b’ then a+b looks like a’+b’.

Another example of the quotient approach is to look at the knight’s tour problem in the case where two opposite corners have been removed from the chessboard. A knight’s tour is a sequence of knight’s moves that visit each square on a board exactly once. If we remove opposite corners of the chessboard, there is no knight’s tour of the remaining 62 squares. How can we prove this? If you don’t see the trick you can get get caught up in all kinds of complicated reasoning. So now put on a filter that removes your ability to see the spatial relationships between the squares so you can only see the colours of the squares. This respects the original problem in the sense that a knight’s move goes from a black square to a white square, or from a white square to a black square. The filter doesn’t stop us seeing this. But now it’s easier to see that there are two more squares of one colour than the other and so no knight’s tour is possible. We didn’t need to be able to see the spatial relationships at all.

(Note that this is the same trick as we use for arithmetic, though it’s not immediately obvious. If we think of the spatial position of a square as being given by a pair of integers (x, y), then the colour is given by x+y modulo 2. In other words, by the last digit of x+y written in binary. So it’s just the see-only-digits-on-the-right filter at work again.)

Wearing filters while programming So now think about developing some code in a dynamic language like Python. Suppose we execute the line:

a = 1

The Python interpreter doesn’t just store the integer 1 somewhere in memory. It also stores a tag indicating that the data is to be interpreted as an integer. When you come to execute the line:

b = a+1

it will first examine the tag in a indicating its type, in this case int, and use that to determine what the type for b should be.

Now suppose we wear a filter that allows us to see the tag indicating the type of some data, but not the data itself. Can we still reason about what our program does?

In many cases we can. For example we can, in principle, deduce the type of

a+b*(c+1)/(2+d)

if we know the types of a, b, c, d. (As I’ve said once before, it’s hard to make any reliable statement about a bit of Python code so let's suppose that a, b, c and d are all either of type int or type float.) We can read and understand quite a bit of Python code wearing this filter. But it’s easy to go wrong. For example consider

if a>1 then: return 1.0 else: return 1

The type of the result depends on the value of the variable a. So if we’re wearing the filter that hides the data, then we can’t predict what this snippet of code does. When we run it, it might return an int sometimes and a float other times, and we won’t be able to see what made the difference.

In a statically typed language you can predict the type of an expression knowing the type of its parts. This means you can reason reliably about code while wearing the hide-the-value filter. This means that almost any programming problem can be split into two parts: a quotient problem where you forget about the values, and then problem of lifting a solution to the quotient problem to a solution to the full problem. Or to put that in more conventional language: designing your data and function types, and then implementing the code that fits those types.

I chose to make the contrast between dynamic and static languages just to make the ideas clear but actually you can happily use similar reasoning for both types of language. Compilers for statically typed languages, give you a lot of assistance if you choose to solve your programming problems this way.

A good example of this at work is given in Haskell. If you're writing a compiler, say, you might want to represent a piece of code as an abstract syntax tree, and implement algorithms that recurse through the tree. In Haskell the type system is strong enough that once you’ve defined the tree type the form of the recursion algorithms is often more or less given. In fact, it can be tricky to implement tree recursion incorrectly and have the code compile without errors. Solving the quotient problem of getting the types right gets you much of the way towards solving the full problem.

And that’s my main point: types aren’t simply a restriction mechanism to help you avoid making mistakes. Instead they are a way to reduce some complex programming problems to simpler ones. But the simpler problem isn’t a subproblem, it’s a quotient problem. Dependent types Dependently typed languages give you even more flexibility with what filters you wear. They allow you to mix up values and types. For example both C++ and Agda (to pick an unlikely pair) allow you to wear filters that hide the values of elements in your arrays while allowing you to see the length of your arrays. This makes it easier to concentrate on some aspects of your problem while completely ignoring others.

Notes I wrote the first draft of this a couple of years ago but never published it. I was motivated to post by a discussion kicked off by Voevodsky on the TYPES mailing list http://lists.seas.upenn.edu/pipermail/types-list/2014/001745.html

This article isn’t a piece of rigorous mathematics and I’m using mathematical terms as analogies.

The notion of a subproblem isn’t completely distinct from a quotient problem. Some problems are both, and in fact some problems can be solved by transforming them so they become both.
More generally, looking at computer programs through different filters is one approach to abstract interpretation http://en.wikipedia.org/wiki/Abstract_interpretation. The intuition section there (http://en.wikipedia.org/wiki/Abstract_interpretation#Intuition) has much in common with what I’m saying.

tag:blogger.com,1999:blog-11295132.post-4999062346864102325

Extensions

The Monad called Free

sigfpe Apr 26, 2014 Updated Apr 26, 2014

Show full content

Introduction

As Dan Doel points out here, the gadget Free that turns a functor into a monad is itself a kind of monad, though not the usual kind of monad we find in Haskell. I'll call it a higher order monad and you can find a type class corresponding to this in various places including an old version of Ed Kmett's category-extras. I'll borrow some code from there. I hunted around and couldn't find an implementation of Free as an instance of this class so I thought I'd plug the gap.

> {-# LANGUAGE RankNTypes, FlexibleContexts, InstanceSigs, ScopedTypeVariables #-}


> import Control.Monad
> import Data.Monoid

To make things unambiguous I'll implement free monads in the usual way here:

> data Free f a = Pure a | Free (f (Free f a))


> instance Functor f => Functor (Free f) where
>     fmap f (Pure a) = Pure (f a)
>     fmap f (Free a) = Free (fmap (fmap f) a)


> instance Functor f => Monad (Free f) where
>     return = Pure
>     Pure a >>= f = f a
>     Free a >>= f = Free (fmap (>>= f) a)

The usual Haskell typeclass Monad corresponds to monads in the category of types and functions, Hask. We're going to want monads in the category of endomorphisms of Hask which I'll call Endo.

The objects in Endo correspond to Haskell's Functor. The arrows in Endo are the natural transformations between these functors:

> type Natural f g = (Functor f, Functor g) => forall a. f a -> g a

So now we are led to consider functors in Endo.

> class HFunctor f where

A functor in Endo must map functors in Hask to functors in Hask. So if f is a functor in Endo and g is a functor in Hask, then f g must be another functor in Hask. So there must be an fmap associated with this new functor. There's an associated fmap for every g and we collect them all into one big happy natural family:

>     ffmap :: Functor g => (a -> b) -> f g a -> f g b

But note also that by virtue of being a functor itself, f must have its own fmap type function associated with it. The arrows in Endo are natural transformations in Hask so the fmap for HFunctor must take arrows in Endo to arrows in Endo like so:

>     hfmap :: (Functor g, Functor h) => Natural g h -> Natural (f g) (f h)

Many constructions in the category Hask carry over to Endo. In Hask we can form a product of type types a and b as (a, b). In Endo we form the product of two functors f and g as

> data Product f g a = Product (f (g a))

Note that this product isn't commutative. We don't necessarily have an isomorphism from Product f g to Product g f. (This breaks many attempts to transfer constructions from Hask to Endo.) We also won't explicitly use Product because we can simply use the usual Haskell composition of functors inline.

We can implement some functions that act on product types in both senses of the word "product":

> left :: (a -> c) -> (a, b) -> (c, b)
> left f (a, b) = (f a, b)


> right :: (b -> c) -> (a, b) -> (a, c)
> right f (a, b) = (a, f b)


> hleft :: (Functor a, Functor b, Functor c) => Natural a c -> a (b x) -> c (b x)
> hleft f = f


> hright :: (Functor a, Functor b, Functor c) => Natural b c -> a (b x) -> a (c x)
> hright f = fmap f

(Compare with what I wrote here.)

We have something in Endo a bit like the type with one element in Hask, namely the identity functor. The product of a type a with the one element type in Hask gives you something isomorphic to a. In Endo the product is composition for which the identity functor is the identity. (Two different meanings of the word "identity" there.)

We also have sums. For example, if we define a functor like so

> data F a = A a | B a a

we can think of F as a sum of two functors: one with a single constructor A and another with constructor B.

We can now think about reproducing an Endo flavoured version of lists. The usual definition is isomorphic to:

> data List a = Nil | Cons a (List a)

And it has a Monoid instance:

> instance Monoid (List a) where
>   mempty = Nil
>   mappend Nil as = as
>   mappend (Cons a as) bs = Cons a (mappend as bs)

We can try to translate that into Endo. The Nil part can be thought of as being an element of a type with one element so it should become the identity functor. The Cons a (List a) part is a product of a and List a so that should get replaced by a composition. So we expect to see something vaguely like:

List' a = Nil' | Cons' (a (List' a))

That's not quite right because List' a is a functor, not a type, and so acts on types. So a better definition would be:

List' a b = Nil' b | Cons' (a (List' a b))

That's just the definition of Free. So free monads are lists in Endo. As everyone knows :-) monads are just monoids in the category of endofunctors. Free monads are also just free monoids in the category of endofunctors.

So now we can expect many constructions associated with monoids and lists to carry over to monads and free monads.

An obvious one is the generalization of the singleton map a -> List a:

> singleton :: a -> List a
> singleton a = Cons a Nil


> hsingleton :: Natural f (Free f)
> hsingleton f = Free (fmap Pure f)

Another is the generalization of foldMap. This can be found under a variety of names in the various free monad libraries out there but this implementation is designed to highlight the similarity between monoids and monads:

> foldMap :: Monoid m => (a -> m) -> List a -> m
> foldMap _ Nil = mempty
> foldMap f (Cons a as) = uncurry mappend $ left f $ right (foldMap f) (a, as)


> fold :: Monoid m => List m -> m
> fold = foldMap id


> hFoldMap :: (Functor f, Functor m, Monad m) => Natural f m -> Natural (Free f) m
> hFoldMap _ (Pure x) = return x
> hFoldMap f (Free x) = join $ hleft f $ hright (hFoldMap f) x


> hFold :: Monad f => Natural (Free f) f
> hFold = hFoldMap id

The similarity here isn't simply formal. If you think of a list as a sequence of instructions then foldMap interprets the sequence of instructions like a computer program. Similarly hFoldMap can be used to interpret programs for which the free monad provides an abstract syntax tree.

You'll find some of these functions here by different names.

Now we can consider Free. It's easy to show this is a HFunctor by copying a suitable definition for List:

> instance Functor List where
>   fmap f = foldMap (singleton . f)


> instance HFunctor Free where
>     ffmap = fmap
>     hfmap f = hFoldMap (hsingleton . f)

We can define HMonad as follows:

> class HMonad m where
>     hreturn :: Functor f => f a -> m f a
>     hbind :: (Functor f, Functor g) => m f a -> Natural f (m g) -> m g a

Before making Free an instance, let's look at how we'd make List an instance of Monad

> instance Monad List where
>     return = singleton
>     m >>= f = fold (fmap f m)

And now the instance I promised at the beginning.

> instance HMonad Free where
>     hreturn = hsingleton
>     hbind m f = hFold (hfmap f m)

I've skipped the proofs that the monad laws hold and that hreturn and hbind are actually natural transformations in Endo. Maybe I'll leave those as exercises for the reader.

Update

After writing this I tried googling for "instance HMonad Free" and I found this by haasn. There's some other good stuff in there too.

tag:blogger.com,1999:blog-11295132.post-3193567045533409414

Extensions

Reinversion Revisited

sigfpe Feb 2, 2014 Updated Feb 2, 2014

Show full content

Introduction

A while back I talked about the idea of reinversion of control using the continuation monad to wrest control back from an interface that only wants to call you, but doesn't want you to call them back. I want to return to that problem with a slightly different solution. The idea is that we build an interpreter for an imperative language that's an embedded Haskell DSL. You arrange that the DSL does the work of waiting to be called by the interface, but from the point of view of the user of the DSL it looks like you're calling the shots. To do this I'm going to pull together a bunch of techniques I've talked about before. This approach is largely an application of what apfelmus described here.

The code

We'll start with some administrative stuff before getting down to the real code:

> {-# LANGUAGE TemplateHaskell #-}


> import Control.Lens
> import Control.Monad
> import Control.Monad.Loops

We'll make our DSL an imperative wrapper around Gloss:

> import Graphics.Gloss.Interface.Pure.Game

We'll define a structure that can be used to represent the abstract syntax tree (AST) of our DSL. Our DSL will support the reading of inputs, adding pictures to the current picture, and clearing the screen.

First we'll need a wrapper that allows us to represent ordinary Haskell values in our DSL:

> data Basic a = Return a

Now we want an expression that represents events given to us by Gloss. Internally we'll represent this by a function that says what our program does if it's given an event. It says what our program does by returning another AST saying what happens when the input is received. (I've previously talked about these kinds of expression trees here).

>              | Input (Event -> Basic a)

We have a command to render some graphics. It appends a new Picture to the current picture. Again, part of the AST muct be another AST saying what happens after the picture is rendered:

>              | Render Picture (Basic a)

And lastly here's the AST for a clear screen command:

>              | Cls (Basic a)

Our AST will form a monad. This will allow us to build ASTs using ordinary Haskell do-notation. This technique is what I described previously here.

> instance Monad Basic where
>     return = Return
>     Return a >>= f = f a
>     Input handler >>= f = Input (\e -> handler e >>= f)
>     Render p a >>= f = Render p (a >>= f)
>     Cls a >>= f = Cls (a >>= f)

You can think of the expression x >>= f as x with the tree f a grafted in to replace any occurrence of Return a in it. This is exactly what Return a >>= f does. But applying >>= f to the other ASTs simply digs down "inside" the ASTs to find other occurrences of Return a.

It's convenient to uses lenses to view Gloss's game world:

> data World = World { _program :: Basic (), _picture :: Picture }
> $(makeLenses ''World)

And now we have some wrappers around the interpreter's commands. The return () provides the convenient place where we can graft subtrees into our AST.

> input = Input return
> render p = Render p (return ())
> cls = Cls (return ())

Now we can start coding. Here's a test to see if a Gloss event is a key down event:

> keydown (EventKey (Char key)            Down _ _) = True
> keydown (EventKey (SpecialKey KeySpace) Down _ _) = True
> keydown _ = False

And now here's a complete program using our DSL. It's deliberately very imperative. It simply iterates over a nested pair of loops, collecting keystrokes and displaying them. It reads a lot like an ordinary program written in a language like Python or Basic:

> mainProgram = do
>     render (Color white $ Scale 0.2 0.2 $ Text "Type some text")


>     forM_ [780, 760..] $ \ypos -> do
>         forM_ [0, 20..980] $ \xpos -> do


>             event <- iterateUntil keydown $ input


>             let key = case event of
>                         EventKey (Char key)            Down _ _ -> key
>                         EventKey (SpecialKey KeySpace) Down _ _ -> ' '


>             when (ypos == 780 && xpos == 0) $ cls
>             render $ Color white $ Translate (xpos-500) (ypos-400) $ Scale 0.2 0.2 $ Text $ [key]

Here is where we launch everything, placing our program and starting Blank picture into the World.

> main = play (InWindow "Basic" (1000, 800) (10, 10))
>  black 
>  60
>  (World mainProgram Blank)
>  (^. picture)
>  handleEvent
>  (const id)

So now we need just one more ingredient, an actual interpreter for our AST. It's the event handler:

> handleEvent :: Event -> World -> World

The Return command is purely a place to graft in subtrees. It should never be interpreted.

> handleEvent _ (World (Return a) _) = error "error!"

After receiving some input, I want the interpreter to keep interpreting commands such as Cls that don't need any more input. I'm going to do this by using a null event EventMotion (0,0). But when an input really is desired, I want this null event to be ignored.

> handleEvent (EventMotion (0, 0)) state@(World (Input handler) _) = state

We render something by mappending it to the current picture stored in the World. But the rendering is carried out by the event handler. We update the state so that at the next event, the subtree of the AST is executed. This means that after updating the picture, the event still needs to be handed back to the event handler:

> handleEvent event state@(World (Render p cont) _) = state & (picture <>~ p)    & (program .~ cont) & handleEvent event

Clearing the screen is similar:

> handleEvent event state@(World (Cls cont)      _) = state & (picture .~ Blank) & (program .~ cont) & handleEvent event

And now we need to handle inputs. We do this by applying the "what happens when the input is received" function to the event. The result is put back in the state indicating that this is what we want to happen at the next event. So the interpreter doesn't stop here, waiting for the next event, the interpreter sends itself a null event.

> handleEvent event state@(World (Input handler) _) = state & (program .~ handler event) & handleEvent (EventMotion (0, 0))

And that's it!

There are many changes that can be made. We can easily add more commands and make the state more complex. But you might also notice that we create the AST only to tear it apart again in the interpreter. We can actually elide the AST creation, but that will eventually bring us back to something like what I originally posted. This shouldn't be a big surprise, I've already shown how any monad can be replaced with the continuation monad here. By the way, it's pretty easy to add a Fork command. You can replace the _program :: Basic() field with _program :: [Basic ()] and interpret this as a list of threads using a scheduler of your choice.

Acknowledgements

I was prompted to write this (a little late, I know) after reading this article and Tekmo's post on reddit. I think ultimately continuations may perform better than using ASTs. But sometimes it's nice to build an AST because they give you an object that can easily be reasoned about and manipulated by code. Much as I love trickery with continuations, I find ASTs are much easier to think about.

Postscript

My real motivation was that I was thinking about games. The rules of games are often given in imperative style: first player 1 does this. Then they do this. If this happens they do that. And then it's player two's turn. I wanted my Haskell code to reflect that style.

Update

Added 'null' event to keep interpreter going when it makes sense to do so, but there's no event pending.

tag:blogger.com,1999:blog-11295132.post-5715474259100996105

Extensions

Distributed computing with alien technology

sigfpe Oct 26, 2013 Updated Oct 30, 2013

Show full content

Introduction

Suppose we are given a function $f$ of $2n$ boolean arguments that returns a boolean result. Alice has $n$ bits, $\{x_1, x_2, \dots, x_n\}$ and Bob has another $n$ bits $\{y_1, y_2, \ldots, y_n\}$ . Alice and Bob are widely separated and don't know each other's bits. What is the total number of bits that Alice has to send to Bob and that Bob has to send to Alice so that between them they can compute $f(x_1, \ldots, x_n, y_1, \ldots, y_n)$ ? Think about how complex $f$ might get. The $x_i$ and $y_i$ might each describe half of a "voxelised" region of space and $f$ might answer a question about a computational fluid dynamics (CFD) simulation running in that space. CFD simulations can be chaotic and so we might expect that in the worst case many bits have to be transferred back and forth between Alice and Bob. In the worst case we might expect that Alice has to send Bob all of her bits, or vice versa.

But in fact Alice needs to send Bob just one bit.

A loophole

To get the communication requirements down to one bit we need to use a loophole. But I hope to (1) justify the cheat to some extent and (2) justify that it's even worthwhile to think about cheats.

Alice and Bob have access to some Ancient technology. They each have one of a pair of boxes. At prearranged times, Alice puts a bit into her box, and Bob puts a bit into his box. A bit pops back out of Alice's box and a bit pops back out of Bob's box. Whatever the input, both Alice and Box have a 0.5 chance of seeing a one or zero pop out of their respective boxes. But when the two outputs are XORed together the result is the logical AND of the two inputs. With such boxes, Alice can compute $f$ after Bob sends a single bit down a conventional communication channel.

"But this is a total cheat!" you complain before I even start to explain their technique. It seems Alice receives a bit that depends on what Bob input, and so Bob is communicating with Alice. But look closely and you'll see that the boxes don't allow any communication. No matter what Bob inputs, Alice has a 0.5 chance of getting zero or one. There is no way Bob can use this to communicate anything. It's like intercepting a message encrypted with a one time pad. Without the pad, the message is basically a sequence of random bits. Nonetheless, it is true that the outputs that Alice and Bob see are correlated.

I hope I've convinced you that Alice and Bob can't send any bits with these boxes. Despite this, it is pretty clear that the behaviour of the boxes is non-local. We'll call any kind of boxes that allow instantaneous long range correlations that can't be explained by purely local behaviour non-local boxes. Boxes that can't be used for message sending are called non-signalling local boxes. And the particular non-local box I describe above is called a PR box (eg. see here).

(BTW As an aside note that as the box results in widely separated outputs that are correlated, but doesn't allow communication, it's an example of how non-locality doesn't imply communication. Usually when people want to give examples of such a thing they talk about quantum mechanics. But there's no need to mention quantum mechanics to explain the behaviour of these particular non-local boxes.)

The method

Any single bit boolean function of a finite sequence of bits can be written as a polynomial modulo 2. Each monomial in the polynomial can be written as a product of terms involing just the $x_i$ and terms involving just the $y_i$ , ie.

$f(x_1, \ldots, x_n, y_1, \ldots, y_n) = \sum_{j}A_jB_j$

where $A_j$ depends only on the $x_i$ , $B_j$ depends only on the $y_i$ and $j$ is drawn from some finite set. Alice can compute the $A_j$ and Bob can compute the $B_j$ . Now Alice and Bob, in parallel, feed $A_j$ and $B_j$ respectively into their PR boxes. We know that we could evaluate each term in the sum we want by adding Alice's output to Bob's output. But that would require sending one one-bit message for each $j$ . But we don't need each term one by one; we just want the sum. So Alice and Bob can individually sum their separate outputs knowing that adding Alice's output and Bob's output modulo 2 will be the correct sum. So Bob sends his sum to Alice. Alice adds that number to her own (modulo 2) and that's the value we want. Only one one-bit message was sent.

But what about reality?

Non-local boxes don't exist, do they? So why are we talking about them?

Actually, non-local boxes exist both theoretically and in the lab. Non-local correlations in quantum mechanics allow them to be constructed. But for this article I wanted to abstract from quantum mechanics and talk about the behaviour of a non-local box without getting my hands dirty with the details of quantum mechanics. Having said that, although non-local boxes do exist, the special case of the PR box can't in fact be constructed with quantum mechanics. In some sense it allows correlations that are "too strong". An article I wrote a while back describes the closest you can get to building a PR box with quantum correlations. Curiously, if you restrict yourself to the kind of non-local box quantum mechanics allows you to build you find that some functions can still be computed with less communication than you'd need if non-local correlations are disallowed. Nonetheless, the worst case scenario with QM still requires the sending of $n$ bits.

Going further there's an interesting conjecture. It says that any non-local box that is even marginally better (in some sense) than what quantum mechanics allows is powerful enough to allow the computation of any $f$ with only a single bit of communication. It suggests that quantum mechanics is right at the edge of the space of possible physics that make life difficult for us. If quantum mechanics were to be tweaked the tiniest amount to make correlations any stronger, large numbers of difficult distributed computing problems would suddenly collapse to become trivial. If the conjecture is true it means that nature looks a bit like a conspiracy to keep computer scientists in work. (It's possible the conjecture has been decided one way or the other by now.)

Final words

There are a couple of papers about universes where PR boxes can be built; so called boxworlds. There is a lot of interesting theoretical work in characterising quantum mechanics. In particular there are a number of theorems and conjectures that describe QM in the form "the most X theory that doesn't allow Y" where X is an interesting property and Y is something you'd like to do.

References

I learnt all of this from the paper Implausible Consequences of Superstrong Nonlocality by Wim van Dam.

tag:blogger.com,1999:blog-11295132.post-2860491179159614524

Extensions

What stops us defining Truth?

sigfpe Oct 12, 2013 Updated Oct 12, 2013

Show full content

Introduction
Recall the standard cartoon sketch of the proof of Gödel's first incompleteness theorem. We start by defining a predicate, $Prov$ , that is true if and only if its argument is provable. (Or more accurately, $Prov(g)$ is true if $g$ is the Gödel number of a provable proposition.) With some quining we can use this to construct the proposition $G$ which says $\neg Prov(G)$ . The proposition $G$ asserts its own unprovability.
Suppose instead we define a predicate $True$ which holds if its argument is true. We can use this to construct the proposition $H$ which says $\neg True(H)$ . Then if $H$ is true it must also be false and if it's false then it must be true. We seem to have a paradox. The loophole is that we assumed the existence of the predicate $True$ . So this argument demonstrates that there is actually no such predicate. This is Tarski's undefinability theorem.
But what exactly stops us defining $True$ ? What goes wrong if we attempt to define a predicate that analyses the parts of a proposition to tell us whether or not it is true?

Note
This article is written in English. But as is standard in much of mathematics, unless I state otherwise, I'm using English largely as shorthand for an argument that could, in principle, be written in the formal language of Set Theory. So I will allow myself to use all of the usual reasoning methods that are available in ZF, even when talking about other formal systems such as Peano Arithmetic.

Defining Truth for Propositional Calculus
Suppose we're given a proposition from propositional calculus like $A\wedge B$ . We can use a syntactic approach to determining whether or not it is true. We determine whether or not $A$ is true, then whether or not $B$ is true, and then the whole proposition is true if both $A$ and $B$ are true. Similarly $A\vee B$ is true if either $A$ or $B$ is true. Of course $A$ and $B$ might themselves be compound propositions using $\wedge$ , $\vee$ and $\neg$ . But that's fine, that simply means that to define truth for such propositions we need to employ recursion. In fact, we can straightforwardly turn such a definition into a recursive computer program.
(Ultimately with propositional calculus we hit the leaves which are atomic propositions like $P$ . Typically when we ask about the truth of a proposition in propositional calculus we've already made an assignment of truth values to the atomic propositions. So the base case for the recursion is straightforward.)
We can illustrate the process with a diagram:

The truth value of a node in the tree is determined by the truth of the propositions hanging underneath it. We have a parent-child relation between a proposition and its subexpressions. Recursion allows us to make a definition by defining what happens on the leaves of such a tree, and by saying how the definition at a node is built from that of its children.

Defining truth for Peano Arithmetic
We can go further and attempt this approach with Peano Arithmetic (PA). The catch is that we need to consider quantifiers. For example, consider this proposition from Peano arithmetic: $\forall x.\neg(Sx=0)$ . This proposition is true if and only if $\neg(Sx=0)$ is true whatever number we substitute for $x$ in the expression.

The proposition at the top of the tree above is true if all of the immediate children are true and their truth is in turn determined by the truth of the propositions immediately below them. With some work this eventually leads to a perfectly good definition of truth for propositions in PA. Because we have nodes with infinitely many children we don't get an algorithm guaranteed to terminate, but that's not a problem for a definition in ZF. Note that we don't literally prove the infinitely many child propositions one at a time. Instead what happens is that to define the truth of $\forall x.Q$ we define it in terms of the truth of some infinite family of propositions all based on $Q$ . ZF is perfectly good at dealing with such definitions without us having to list every element of our family explicitly.
Note how in this case the tree isn't the parse tree of the proposition. It's much bigger with nodes that have infinite branching. But that's fine, there's nothing about infinite branching that prevents us making a recursive definition. So we can ultimately extend the idea for defining truth in propositional calculus to include quanifiers and then all of Peano arithmetic.

Defining truth for ZF
But the approach used for PA looks like it might work perfectly well for ZF as well. For example, our definition of truth would say that $\forall x.P(x)$ is true if $P(x)$ is true whatever set we substitute for $x$ . In ZF there is no difficulty in defining a predicate that uses quantification over all sets. So it seems we can define $True$ for ZF in ZF, contradicting Tarski's theorem.

What went wrong?
Recursive definitions typically rely on the parent-child relation I mentioned above. To recursively define something we (1) define it for all leaves and then (2) specify how the definition at a parent is given in terms of the value for all of its children. We then invoke a recursion theorem of some sort to show how this uniquely defines our object for everything in our universe. For example, one form of recursion in Peano arithmetic has $0$ as its leaf and the only child of $S(n)$ is $n$ . The induction axiom for PA can be used to show that definitions using this parent-child relation are valid.
Similarly in ZF we have the empty set as leaf and the children of a set are simply its elements. But now we need to look closely at the recursion principle we need. For ZF we need to invoke the Transfinite Recursion Theorem. Transfinite recursion is very powerful. It's not just limited to induction over sets. It can also be used for induction over classes. For example if you need to recursively define a function on the class of all sets it can allow this. (Strictly speaking it'll be a function class rather than a function.) But now comes the catch. If you take a look at the Wikipedia article it mentions that the parent-child relation, $R$ , needs to be set-like (though as the article is currently written it's almost an afterthought). For this theorem to apply we need the collection of children of a proposition to form a set. But to prove the truth of a proposition with a quantifier at the front we need to prove something is true for all children where there is one child for each set. This means the children don't form a set. So we can't use transfinite recursion. And this means the informal definition of truth I gave above can't be turned into a rigorous definition.

Conclusion
I think this issue is quite subtle. It's really easy to say in English "this thing is true if that thing is true for all sets". Such a sentence in isolation can often be turned into a rigorous proposition in ZF. But if that sentence is part of a collection of sentences that refer to each other forming an attempt at a mutually recursive definition, you need to check precisely what parent-child relation you're using.

Acknowledgement
Thanks to Sridar Ramesh for making clear to me why the attempted definition of truth in ZF doesn't work. But I've probably made some mistakes above and they have nothing to do with Sridar.

tag:blogger.com,1999:blog-11295132.post-2325584291136500714

Extensions

https://blogger.com/feeds/11295132/posts/default

Posts