Show full content
Here’s a penny to a brown god – not that I know much about him – sullen, untameable, intractable, prone to numerical issues. Patient at times, and at first recognized as a frontier god. Useful, untrustworthy and a conveyor of speech. Forgotten to some extent, but omniscient, and always there.
His rhythm was present in the nursery bedroom,
In the rank ailanthus of the April dooryard,
In the smell of grapes on the autumn table,
And the evening circle in the winter gaslight.
Recently, I have been studying HMMs with a view to working out and implementing something called the CTC loss [1, 2, 3]- a conception that forms the core of many a modern speech recognition application, most recently reincarnating here [4, 5, 6] in streaming ASR. But tangentially, I found that it might be instructive to present how Langrange Multipliers work through two simple examples in GMMs and HMMs. Both of these are properly explained in Bishop 2006 [7]. The HMM machinery in general is an extension of the GMM setup, with state transitions.
GMM example – EMHere, we look at the general scenario where our distribution of interest is determined by latent variables
and is part of a multimodal distribution consisting of
mixture components (so our latent variables are one-hot encodings). We ‘pick’ a mixture component with probability
, and our data is the sum of these Gaussian components.


Upon inspection of the EM update above, we can see that we would like to take expectations over the . The old parameters
is used to compute this term, after which maximize the ensuing weighted likelihood terms with respect to the parameters
. Rewriting this a little differently, it can be seen that it boils down to obtaining the mixture component that contributes to the likelihood of the given point. Bishop calls them responsibilities.

We then maximize the likelihood (or, if you will, the quantity ) in the second step by taking derivatives with respect to the parameters
.
The maximization with respect to are fairly straightforward. Below, we show how it is done for
.
Take derivative of the log likelihood equation with respect to .



Now, in order to impose constraints on , we use the machinery of Lagrange multipliers. We note that the mixture components sum to unity.

Consider maximizing the quantity

Taking derivative with respect to gives

This gives, after multiplying by , summing and inserting responsibility terms:

This completes our demonstration for the GMM case.
HMM exampleIn this case, we have a sequential model consisting of timesteps, with a visible state
being observed which could arise from one of
hidden states at each timestep
. The latent variable
is a
dimensional vector with a single non-zero entry. At each timestep, the visible state being emitted is only a function of the hidden states. As for the hidden states, they are ‘Markovian’ in that they only depend on the previous timestep. Further, we also assume that the hidden distribution is the same at each timestep. The governing parameter
is thus ‘shared’ across all timesteps (this, I suppose is similar to the recurrent neural net formalism). Similarly, the emitting distribution is also assumed to be the same across all timesteps
. In Bishop, they use a continuous model – a Gaussian – but this could also be discrete. The starting state
is governed by a hyperprior
.

In order to do maximum likelihood estimation for the HMM, we again resort to the EM procedure to iteratively update the model parameters. The equation is involved, as before.

If we plug in the expression for the joint into the logarithm, we notice that it splits into three pieces. The second and third pieces need some unrolling. Notice that in the second piece, two variables and
are involved. When we take the expectation using the posterior
, there are
variables, of which all but
and
get marginalized. We therefore need to consider the joint
. In similar vein, the last product – corresponding to the visible states –
involves the term
to take expectations with, all other latent timesteps getting marginalized out. Putting it formally, the two influential terms in carrying out the expectation are:

Next, we rewrite the expectations making use of the fact that is binary valued, and that the expectation is simply the probability of it taking the value 1 (we also use the same machinery in GMMs).

We can now put it all together to write what the criterion looks like.

The constraints involve and
:


Upon differentiating, multiplying by and summing, we get


We can perform exactly similar set of operations for . This has an extra summation term over
.

This completes our treatment of Lagrange Multipliers for HMMs
ConclusionsNot to put too fine a point on it, this is an exposition of two examples of Lagrange Multipliers from the well known PRML book by Bishop. The reason I decided to put it up was that in many texts explaining the HMM (e.g. Duda [7] – also, quite recommended), they skim over how these (and similar) terms arise in the forward-backward algorithm. This was a cause of some confusion to me when I went over them. However, it is enormously instructive (though, perhaps a bit time consuming) to work it out from first principles.
References- Alex Graves et al.: Connectionist Temporal Classification https://www.cs.toronto.edu/~graves/icml_2006.pdf
- Hunnan et al.: Deep Speech – Scaling End to End Speech Recognition https://arxiv.org/abs/1412.5567
- Hunnan: Distill blog – https://distill.pub/2017/ctc/
- RNN-Transducer – googleai blog – https://ai.googleblog.com/2019/03/an-all-neural-on-device-speech.html
- RNN-Transducer – https://arxiv.org/abs/1811.06621
- RNN-Transducer – Assembly AI blog – https://www.assemblyai.com/blog/an-overview-of-transducer-models-for-asr/
- Bishop 2006 – https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
- Duda, Hart and Stork – Pattern Classification, 2nd Edition – https://www.wiley.com/en-gb/Pattern+Classification%2C+2nd+Edition-p-9780471056690







































