Show full content
Wide-Band Voice Pulse Modeling[1] (also referred to as Wide-Band Harmonic Sinusoidal Modeling[2][5], presumably because it is also applicable to other monophonic harmonic sounds) is a hybrid time-frequency audio processing technique that models each time-domain period of the fundamental frequency in the frequency domain. This is done by calculating the times of minimally-phased impulses (in the case of the voice, these correspond to to glottal closure instants), and then extracting a window centered on said impulse with its size corresponding to the period of the fundamental frequency at that instant in time.
One advantage of this technique is that both the time and frequency resolution is better than what would happen using the traditional techniques. This is because, since the harmonics are all integer multiples of the fundamental frequency, they actually correspond exactly to the frequencies of each bin of the fourier transform (i.e. 0th bin [DC] = 0th harmonic [also DC], 1st bin = 1st harmonic, 2nd bin = 2nd harmonic) when the window size is the same as the period of the fundamental frequency. In theory, this would result in a perfect estimation of spectrum. In practice this is not completely true (due to varying fundamental frequency and amplitude, varying timbre, estimation error in the maximally flat phase onset detection, and the input signal being discrete sampled and containing non-harmonic frequencies), however the error is still much lower than other techniques.

Another advantage is since each window is centered on the minimal phase impulse time, it is also phase-locked and shape-invariant. It does not experience the phase drift that a pitch-synchronous analysis would because the windows, by definition, are always centered on the corresponding point in cycle of the fundamental frequency.
However the main disadvantage of this technique is that it models harmonics and only harmonics. However, the non-harmonic components are not lost, but rather becomes represented as rapidly varying noise in the pulse spectrum, according to [1]. I have also been able to notice this experimentally.

We often want to model non-harmonic components different. For example, we may want to model the stochastic residual as white noise filtered by a filter obtained from an approximation of its smoothed envelope[3] or from an estimation of the vocal tract formants[4]. Another we might want to process separately are subharmonics; for example, for the purpose of applying a growl-type effect[6]. Finally we may want to model transients differently, for example when applying time-scaling and pitch transposition effects[3].
In this post, I will describe several techniques I have to modify the WBVPM model to allow the handling of non-harmonic components separately, and also detail a method for processing the subharmonics. I am yet to test any of these techniques.
Residual SeparationBefore doing anything else specifically for subharmonics, we first need a method allows us to process the harmonics using WBVPM and the non-harmonics separately. I thought of a method for doing this some time ago, however I recently became aware that [5] uses the same overall idea as mine, although their approach is different and they stick to modeling both the harmonic and non-harmonic within the WBVPM model, whereas my method allows processing the non-harmonic separately.
The basic idea is that timbre evolves slowly over time, whereas the noise from non-harmonic components evolves very rapidly; actually at the frequency defined as reciprocal of the distance in time of the different pulses, so it is independent of pitch. In the frequency domain, this corresponds to timbre changes having their energy concentrated mainly at low frequencies, and decaying as the frequency increases (where frequency in this context actually corresponds to the time-evolution of voice pulses); and the noise having its energy concentrated at high frequencies. One unfortunate side-effect of any technique using this principle is that since low pitch sounds have pulses that are further apart, the frequency of the noise is lower, and the gap between the frequencies arising from timbre changes is now lower since those don’t change. This is especially unfortunate since other techniques used already perform worse for lower pitches[1], so this further compounds that.
Any implementation that uses this principle must also take into account the following:
- The voice pulse onsets are not necessarily spaced at the same exact interval. So just using an FFT with each point corresponding to step of one voice pulse to the next voice pulse would introduce error proportional to the deviancy of the voice pulse onset times from the ideal sequence where there are no deviations and the fundamental frequency is constant.
- The voice pulses may overlap or miss areas in the time-domain.
- To avoid aliasing, the number of harmonics for each voice pulse varies with the fundamental frequency. So, some harmonics may come in and out of existance.
- To improve the estimation of the harmonics, a border interpolation step is applied before applying the FFT for each voice pulse in WBVPM. However, this step only makes sense for harmonics and would presumably interfere with the value of non-harmonic components significantly
With that said, I have devised two approaches. However, both of those approaches share in common a method. This method consists of the following steps:
- We first solve the issue of varying pulse times by resampling (using a natural cubic in my implementation, although other techniques could be used instead) the values for the evolution of an individual harmonic over different voice pulses. The input positions to the interpolation are the times of the voice pulses, however the output positions are separated by a fixed interval, thus solving the first issue. An implementation of this may need to take into account aliasing artifacts.
- Then, we apply a fourier transform to the resampled harmonic values. We can then divide the spectrum into a high frequency region and low frequency. The high frequency corresponds to the residual non-harmonic components, and the low frequency (including the DC importantly) region corresponds to the harmonics. Importantly, the high frequency component actually corresponds to the residual as if it were a harmonic, and not the residual directly.
- We can then obtain the separated values by reversing the fourier transform and resampling process. Some amount of error may be introduced by the resampling (i.e. resampling and reversing the resampling would result in a different signal, even if no transform is applied in between). This counteracted by increasing the amount of sampling steps (thus at a short time delta) in the intermediate resampled representation. An improvement without needing to do this could also be done by calculating first the differential between the original evolution of the harmonic property over different pulses, and its values after resampling and reverse resampling, without the frequency domain transform being applied; and then adding this differential to this LF data afterwards.
Now the two approaches to it are as follows:
LF and HF approachIn this approach, we actually compute two different version of the voice pulse in WBVPM. For one version, we apply border interpolation; and for the other, we do not. We then apply the aforementioned method to both, and keep only the LF result from the version with border interpolation, and only the HF result from the one without border interpolation. This lets us overcome the fourth issue.
For the third issue, we can interpolate the closest harmonics for the missing harmonic, and apply a decay based on the distance to nearest non-missing harmonic. For the phase, we need to interpolate it in a way that such that it is continuous over both frequency and time. [7] proposes a method for accomplishing this. Another option would be interpolate the whole amplitude spectrum and then reconstruct an artificial phase spectrum from it using the technique described in [1]. I got this idea from [5]. However, this phase reconstruction technique is only applicable to maximally flat phased harmonics, so it could only be used for the LF border interpolated version.
Finally, we take the HF result and synthesize it using the WBVPM synthesis method, or a modified version of it, to obtain a time-domain representation of the residual which can then be further processed by other techniques.
This approach has several issues. One is that it does not handle issue #2. When we are dealing with harmonic pulses, those missing values can be reconstructed very well using a technique I proposed in this previous post; or just ignored entirely as in the original WBVPM method. Actually, these two techniques are actually equivalent for harmonics if the fundamental frequency remains stationary, and both it and the MFPA onsets are estimated perfectly. However, neither of this applies for non-harmonic components.
Another issue is that of phase. For the LF harmonic result, we can just apply the same method to the unwrapped over frequency and time phase (using the method from [1]) and keep only the LF. However, we need to do something different for HF non-harmonic components. One idea I came up with would to calculate LF and HF unwrapped phase results for the version of the pulse without border interpolation; then calculate the difference between the original phase and LF phase and divide by ratio of the difference between the original amplitude and LF amplitude to the original amplitude; and finally apply princarg to put the phase in the wrapped domain [-pi, +pi). However, I am not confident that this would work well.
LF and Residual approachAnother approach would be to calculate only the LF border-interpolated pulse, and then compute the residual by synthesizing it in the time-domain using the normal WBVPM synthesis approach that we use and then compute the difference between the input signal and this. This is then our time-domain residual. Issue #3 can be solved in the same way as the previous issue.
This has the advantage of solving issues #2 and #4 implicitly and also being simpler. Another seeming advantage is that the sum of the residual and harmonics is exactly equal to the input signal; however, this may actually be the only disadvantage of this approach. The reconstructed signal in WBVPM is slightly different from the original signal, and that difference is now included in the residual, which may result in undesired components.
Actually two variants of this technique exist. In one, we compute the LF of frequency and unwrapped phase separately before separating the residual. In the other, we calculate just the amplitude and reconstruct the phase from the amplitude using the phase-from-amplitude approach before separating the residual. The phase reconstruction is done implicitly after the residual separation for both approaches, so long as we are using the Excitation plus Resonance model. I am not certain which would be better, but I lean towards the former as I feel like the latter would introduce more unwanted artifacts into the residual.
Another possible approach would be to do the opposite – instead of constructing HF as the residual from LF, taking LF as the residual from the unborder-interpolated HF. I feel like this is very unlikely to be the best technique though.
DC ComponentIn theory, the DC component should remain constant between voice pulses. However, I have noticed that it in practice, it actually varies quite significantly between subsequent. This is presumably mainly due to subharmonics, and to a lesser degree, other non-harmonic components. We could perhaps in the first approach, transfer the whole DC component from the LF harmonic pulse to the HF non-harmonic pseudo-pulse. The equivalent for the second approach would be to just set the DC to zero in each pulse; or to its average over the whole audio signal. However, DC might also arise due to fundamental frequency estimation errors. I am not sure how effective the border interpolation method is for counteracting this. Perhaps an additional step would be needed for that specifically.
Residual HarmonicsIn [7], the authors noted that they noted that for the technique they were experimenting with, Narrow-Band Voice Pulse Modeling, which allowed directly the separation of harmonic of non-harmonic components, there were some traces of the harmonics present in the residual. They used a filter to attenuate the frequencies around the harmonics in the residual. I am not sure whether this would also be a significant issue for my WBVPM separation approach. However, if it is, we could presumably do something similar. We could also add the energy removed from the residual back to the corresponding harmonics in the wide-band pulse model, to conserve energy and improve the harmonic estimation.
Since we would presumably be doing this using a narrow-band processing technique with a fixed hop-size, the times would be different from those of the voice pulses, so we would have to interpolate. We would also have to interpolate the unwrapped phase and compensate for it when adding it to the harmonic pulses. If the residual harmonics vary very fast, on the order of the duration of a few voice pulses or less and by large relative amount, it be better to avoid doing the energy conservation entirely and just remove them from the residual.
Subharmonic ModelingIn the previous section, I discussed techniques for separating harmonic and non-harmonic components within the WBVPM framework. Now, I will discuss a way of modeling subharmonics within that residual.
In a system producing a harmonic sound, there is of course the primary vibrator that is vibrating at the fundamental frequency. However, there may also be other components which vibrate at integer-reciprocals of the fundamental frequency[6]. These components are also harmonic in nature and thus also produce their own harmonics, which are referred to as subharmonics.
Since the sub-fundamental frequency of the subharmonics is the fundamental frequency divided by an integer M, every N*Mth subharmonic actually has the same frequency as the Nth regular harmonic. For example, if M is two, the second subharmonic would have the same frequency as the first harmonic, and the fourth subharmonic would have the same frequency as the second harmonic. If M was instead four, the four subharmonic would have the same frequency as the first harmonic, and the eighth subharmonic would have the same frequency as the second harmonic.
The sub-fundamental frequency could be obtained either by dividing the fundamental frequency at a given time by M, or by running the TWM pitch estimation algorithm on the residual. The presence of significant subharmonics could be detected perhaps by calculating the TWM error for the sub-fundamental frequency, or maybe by measuring the energy of the estimated subharmonics relative to the total residual energy.
Since the subharmonics are also harmonic in nature, we can process them using the WBVPM technique used for the regular harmonics, except now we are applying it to just the residual.
The first step (after fundamental frequency estimation) for WBVPM is calculate the minimally-phased onsets, which is done by the MFPA algorithm. The original MFPA algorithm[1][7] was done using a phase-vocoder with a constant window and hop size. However, I have thought that using a pitch-synchronous analysis instead might improve the accuracy, and thus improve the result of WBVPM too. Additionally, if that is true, using a border interpolation similar to that used in WBVPM also might help.
Since every Mth subharmonic corresponds to a harmonic as well, it will have been already filtered out by the original WBVPM harmonic estimation, and thus will be effectively zero. Assuming the amplitude of the subharmonics varies slowly over frequency, we can obtain a good approximation by calculating an envelope of the subharmonics and interpolating the missing subharmonics. Such an envelope calculation and interpolation is often done via a natural cubic spline.
We actually need to do this at two separate times. We need to do it first in the MFPA, and then again in WBVPM. For MFPA, we can interpolate the amplitude from the amplitude from the estimated amplitude envelope. For the phase, we could set it to be the same as the proceeding subharmonic, or from an interpolation of an estimated unwrapped phase envelope. Another option could be to just ignore the phase differentials from and to the missing subharmonic. However, this may decrease the quality of the MFPA estimation; and furthermore, when M is 2 (which I believe is actually the most common situation), excludes all phase differentials and is thus non-applicable.
We also calculate the missing subharmonics again in WBVPM. For the phase, we can’t just copy the previous subharmonic phase as that could result in significant phase errors around formants. The only reason that could work for MFPA is because we don’t care about the phase values themselves, but the differentials between them; so copying the previous value increases the next differential but sets the current one to zero, which is equivalent to if we had the actual subharmonic phase assuming that phase lies in between the two adjacent known subharmonic phases.
We can calculate the phase for the missing subharmonics again in WBVPM by either the unwrapped phase envelope interpolation, or by creating an artificial phase envelope from the amplitude envelope as described in [1]. The latter wasn’t an option in the MFPA onset determination stage as it requires the window to be centered on the MFPA onset. The latter may also interfere with the ability to extract further residual from the subharmonic WBVPM, depending on which algorithm is used and its implementation.
We can also subtract the estimated missing subharmonics from their corresponding harmonics in the original WBVPM harmonic estimation to improve the estimation of the harmonics. I am not sure what the relation of phase would be in this situation. If we are using the method from [1] to reconstruct the phase from the amplitude for the harmonic WBVPM pulse spectrums, we could just recompute said phase spectrum after the subtraction.
After the subharmonic WBVPM is ran, we can further extract residual using the methods from the previous section. This residual may be yet still processed, for example to separate transients and stochastic residual, and apply separate processing to them; for example, not applying time-scaling and pitch-transposition to the former, and/or modeling the latter as filtered white noise.
SynchronizationIt may be that since we have a separate onset sequence for the harmonic and subharmonic pulses, with the latter being more sparse by a factor of M, that applying transforms to said onset sequences introduces a de-synchronization between the two that produces an unnatural. Perhaps this could solved by assigning each subharmonic pulse to the closest M harmonic pulses, calculating the center of said harmonic pulses as collective, and then calculating the differential between said collective center and the center of the subharmonic pulse. Then, at synthesis, after any onset sequence transforms have been applied, we could find again M closest harmonic pulses in the transformed harmonic pulse sequence, calculate the center of the collective of that group of pulses, and then finally set the time of the subharmonic pulse to said collective center plus the offset that was obtained for the subharmonic pulse at analysis. Perhaps said offset could be scaled by a factor determined from the pitch transposition factor and the time-scaling factor at that time, if either or both of those transforms had been applied.
To reduce jitter from this synchronization process, perhaps we could apply a method similar to approach used for refining the MFPA onsets in [1].
References- 1. Bonada Sanjaume, Jordi. “Voice Processing and Synthesis by Performance Sampling and Spectral Models” 2008, PhD Dissertation, Pompeu Fabra University
- 2. Bonada Sanjaume, Jordi. “Wide-Band Harmonic Sinusoidal Modeling” 2008, International Conference on Digital Audio Effects
- 3. Bonada Sanjaume, Jordi. “Audio Time-Scale Modification in the Context of Professional Audio Post-production” 2002, Pompeu Fabra University
- 4. Bonada Sanjaume, Jordi; Celma, Òscar; Loscos, Àlex; Ortolà, Jaume; Serra, Xavier. “Singing Voice Synthesis Combining Excitation plus Resonance and Sinusoidal plus Residual Models” 2001, International Computer Music Conference
- 5. Bonada Sanjaume, Jordi; Umbert, Martí; Blaauw, Merlijn. “Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016” 2016, Proceedings of Interspeech 2016
- 6. Loscos, Alex; Bonada Sanjaume, Jordi. “Emulating Rough and Growl Voice in Spectral Domain” 2004, International Conference on Digital Audio Effects
- 7. Bonada Sanjaume, Jordi. “High Quality Voice Transformations Based on Modeling Radiated Voice Pulses in Frequency Domain” 2004, International Conference on Digital Audio Effects










