GeistHaus
log in · sign up

Queue Seven M.

Part of wordpress.com

stories
Wide-Band Subharmonic Modeling
Uncategorizedaudio-processingdigital-signal-processingsinging-synthesisvocal-synthesisvocaloidvoice-processing
Wide-Band Voice Pulse Modeling[1] (also referred to as Wide-Band Harmonic Sinusoidal Modeling[2][5], presumably because it is also applicable to other monophonic harmonic sounds) is a hybrid time-frequency audio processing technique that models each time-domain period of the fundamental frequency in the frequency domain. This is done by calculating the times of minimally-phased impulses (in the […]
Show full content

Wide-Band Voice Pulse Modeling[1] (also referred to as Wide-Band Harmonic Sinusoidal Modeling[2][5], presumably because it is also applicable to other monophonic harmonic sounds) is a hybrid time-frequency audio processing technique that models each time-domain period of the fundamental frequency in the frequency domain. This is done by calculating the times of minimally-phased impulses (in the case of the voice, these correspond to to glottal closure instants), and then extracting a window centered on said impulse with its size corresponding to the period of the fundamental frequency at that instant in time.

One advantage of this technique is that both the time and frequency resolution is better than what would happen using the traditional techniques. This is because, since the harmonics are all integer multiples of the fundamental frequency, they actually correspond exactly to the frequencies of each bin of the fourier transform (i.e. 0th bin [DC] = 0th harmonic [also DC], 1st bin = 1st harmonic, 2nd bin = 2nd harmonic) when the window size is the same as the period of the fundamental frequency. In theory, this would result in a perfect estimation of spectrum. In practice this is not completely true (due to varying fundamental frequency and amplitude, varying timbre, estimation error in the maximally flat phase onset detection, and the input signal being discrete sampled and containing non-harmonic frequencies), however the error is still much lower than other techniques.

Minimally-phased impulses as detected by the MFPA algorithm

Another advantage is since each window is centered on the minimal phase impulse time, it is also phase-locked and shape-invariant. It does not experience the phase drift that a pitch-synchronous analysis would because the windows, by definition, are always centered on the corresponding point in cycle of the fundamental frequency.

However the main disadvantage of this technique is that it models harmonics and only harmonics. However, the non-harmonic components are not lost, but rather becomes represented as rapidly varying noise in the pulse spectrum, according to [1]. I have also been able to notice this experimentally.

Noise in the WBVPM result; frequency increases from the top to the bottom.

We often want to model non-harmonic components different. For example, we may want to model the stochastic residual as white noise filtered by a filter obtained from an approximation of its smoothed envelope[3] or from an estimation of the vocal tract formants[4]. Another we might want to process separately are subharmonics; for example, for the purpose of applying a growl-type effect[6]. Finally we may want to model transients differently, for example when applying time-scaling and pitch transposition effects[3].

In this post, I will describe several techniques I have to modify the WBVPM model to allow the handling of non-harmonic components separately, and also detail a method for processing the subharmonics. I am yet to test any of these techniques.

Residual Separation

Before doing anything else specifically for subharmonics, we first need a method allows us to process the harmonics using WBVPM and the non-harmonics separately. I thought of a method for doing this some time ago, however I recently became aware that [5] uses the same overall idea as mine, although their approach is different and they stick to modeling both the harmonic and non-harmonic within the WBVPM model, whereas my method allows processing the non-harmonic separately.

The basic idea is that timbre evolves slowly over time, whereas the noise from non-harmonic components evolves very rapidly; actually at the frequency defined as reciprocal of the distance in time of the different pulses, so it is independent of pitch. In the frequency domain, this corresponds to timbre changes having their energy concentrated mainly at low frequencies, and decaying as the frequency increases (where frequency in this context actually corresponds to the time-evolution of voice pulses); and the noise having its energy concentrated at high frequencies. One unfortunate side-effect of any technique using this principle is that since low pitch sounds have pulses that are further apart, the frequency of the noise is lower, and the gap between the frequencies arising from timbre changes is now lower since those don’t change. This is especially unfortunate since other techniques used already perform worse for lower pitches[1], so this further compounds that.

Any implementation that uses this principle must also take into account the following:

  1. The voice pulse onsets are not necessarily spaced at the same exact interval. So just using an FFT with each point corresponding to step of one voice pulse to the next voice pulse would introduce error proportional to the deviancy of the voice pulse onset times from the ideal sequence where there are no deviations and the fundamental frequency is constant.
  2. The voice pulses may overlap or miss areas in the time-domain.
  3. To avoid aliasing, the number of harmonics for each voice pulse varies with the fundamental frequency. So, some harmonics may come in and out of existance.
  4. To improve the estimation of the harmonics, a border interpolation step is applied before applying the FFT for each voice pulse in WBVPM. However, this step only makes sense for harmonics and would presumably interfere with the value of non-harmonic components significantly

With that said, I have devised two approaches. However, both of those approaches share in common a method. This method consists of the following steps:

  1. We first solve the issue of varying pulse times by resampling (using a natural cubic in my implementation, although other techniques could be used instead) the values for the evolution of an individual harmonic over different voice pulses. The input positions to the interpolation are the times of the voice pulses, however the output positions are separated by a fixed interval, thus solving the first issue. An implementation of this may need to take into account aliasing artifacts.
  2. Then, we apply a fourier transform to the resampled harmonic values. We can then divide the spectrum into a high frequency region and low frequency. The high frequency corresponds to the residual non-harmonic components, and the low frequency (including the DC importantly) region corresponds to the harmonics. Importantly, the high frequency component actually corresponds to the residual as if it were a harmonic, and not the residual directly.
  3. We can then obtain the separated values by reversing the fourier transform and resampling process. Some amount of error may be introduced by the resampling (i.e. resampling and reversing the resampling would result in a different signal, even if no transform is applied in between). This counteracted by increasing the amount of sampling steps (thus at a short time delta) in the intermediate resampled representation. An improvement without needing to do this could also be done by calculating first the differential between the original evolution of the harmonic property over different pulses, and its values after resampling and reverse resampling, without the frequency domain transform being applied; and then adding this differential to this LF data afterwards.

Now the two approaches to it are as follows:

LF and HF approach

In this approach, we actually compute two different version of the voice pulse in WBVPM. For one version, we apply border interpolation; and for the other, we do not. We then apply the aforementioned method to both, and keep only the LF result from the version with border interpolation, and only the HF result from the one without border interpolation. This lets us overcome the fourth issue.

For the third issue, we can interpolate the closest harmonics for the missing harmonic, and apply a decay based on the distance to nearest non-missing harmonic. For the phase, we need to interpolate it in a way that such that it is continuous over both frequency and time. [7] proposes a method for accomplishing this. Another option would be interpolate the whole amplitude spectrum and then reconstruct an artificial phase spectrum from it using the technique described in [1]. I got this idea from [5]. However, this phase reconstruction technique is only applicable to maximally flat phased harmonics, so it could only be used for the LF border interpolated version.

Finally, we take the HF result and synthesize it using the WBVPM synthesis method, or a modified version of it, to obtain a time-domain representation of the residual which can then be further processed by other techniques.

This approach has several issues. One is that it does not handle issue #2. When we are dealing with harmonic pulses, those missing values can be reconstructed very well using a technique I proposed in this previous post; or just ignored entirely as in the original WBVPM method. Actually, these two techniques are actually equivalent for harmonics if the fundamental frequency remains stationary, and both it and the MFPA onsets are estimated perfectly. However, neither of this applies for non-harmonic components.

Another issue is that of phase. For the LF harmonic result, we can just apply the same method to the unwrapped over frequency and time phase (using the method from [1]) and keep only the LF. However, we need to do something different for HF non-harmonic components. One idea I came up with would to calculate LF and HF unwrapped phase results for the version of the pulse without border interpolation; then calculate the difference between the original phase and LF phase and divide by ratio of the difference between the original amplitude and LF amplitude to the original amplitude; and finally apply princarg to put the phase in the wrapped domain [-pi, +pi). However, I am not confident that this would work well.

LF and Residual approach

Another approach would be to calculate only the LF border-interpolated pulse, and then compute the residual by synthesizing it in the time-domain using the normal WBVPM synthesis approach that we use and then compute the difference between the input signal and this. This is then our time-domain residual. Issue #3 can be solved in the same way as the previous issue.

This has the advantage of solving issues #2 and #4 implicitly and also being simpler. Another seeming advantage is that the sum of the residual and harmonics is exactly equal to the input signal; however, this may actually be the only disadvantage of this approach. The reconstructed signal in WBVPM is slightly different from the original signal, and that difference is now included in the residual, which may result in undesired components.

Actually two variants of this technique exist. In one, we compute the LF of frequency and unwrapped phase separately before separating the residual. In the other, we calculate just the amplitude and reconstruct the phase from the amplitude using the phase-from-amplitude approach before separating the residual. The phase reconstruction is done implicitly after the residual separation for both approaches, so long as we are using the Excitation plus Resonance model. I am not certain which would be better, but I lean towards the former as I feel like the latter would introduce more unwanted artifacts into the residual.

Another possible approach would be to do the opposite – instead of constructing HF as the residual from LF, taking LF as the residual from the unborder-interpolated HF. I feel like this is very unlikely to be the best technique though.

DC Component

In theory, the DC component should remain constant between voice pulses. However, I have noticed that it in practice, it actually varies quite significantly between subsequent. This is presumably mainly due to subharmonics, and to a lesser degree, other non-harmonic components. We could perhaps in the first approach, transfer the whole DC component from the LF harmonic pulse to the HF non-harmonic pseudo-pulse. The equivalent for the second approach would be to just set the DC to zero in each pulse; or to its average over the whole audio signal. However, DC might also arise due to fundamental frequency estimation errors. I am not sure how effective the border interpolation method is for counteracting this. Perhaps an additional step would be needed for that specifically.

Residual Harmonics

In [7], the authors noted that they noted that for the technique they were experimenting with, Narrow-Band Voice Pulse Modeling, which allowed directly the separation of harmonic of non-harmonic components, there were some traces of the harmonics present in the residual. They used a filter to attenuate the frequencies around the harmonics in the residual. I am not sure whether this would also be a significant issue for my WBVPM separation approach. However, if it is, we could presumably do something similar. We could also add the energy removed from the residual back to the corresponding harmonics in the wide-band pulse model, to conserve energy and improve the harmonic estimation.

Since we would presumably be doing this using a narrow-band processing technique with a fixed hop-size, the times would be different from those of the voice pulses, so we would have to interpolate. We would also have to interpolate the unwrapped phase and compensate for it when adding it to the harmonic pulses. If the residual harmonics vary very fast, on the order of the duration of a few voice pulses or less and by large relative amount, it be better to avoid doing the energy conservation entirely and just remove them from the residual.

Subharmonic Modeling

In the previous section, I discussed techniques for separating harmonic and non-harmonic components within the WBVPM framework. Now, I will discuss a way of modeling subharmonics within that residual.

In a system producing a harmonic sound, there is of course the primary vibrator that is vibrating at the fundamental frequency. However, there may also be other components which vibrate at integer-reciprocals of the fundamental frequency[6]. These components are also harmonic in nature and thus also produce their own harmonics, which are referred to as subharmonics.

Since the sub-fundamental frequency of the subharmonics is the fundamental frequency divided by an integer M, every N*Mth subharmonic actually has the same frequency as the Nth regular harmonic. For example, if M is two, the second subharmonic would have the same frequency as the first harmonic, and the fourth subharmonic would have the same frequency as the second harmonic. If M was instead four, the four subharmonic would have the same frequency as the first harmonic, and the eighth subharmonic would have the same frequency as the second harmonic.

The sub-fundamental frequency could be obtained either by dividing the fundamental frequency at a given time by M, or by running the TWM pitch estimation algorithm on the residual. The presence of significant subharmonics could be detected perhaps by calculating the TWM error for the sub-fundamental frequency, or maybe by measuring the energy of the estimated subharmonics relative to the total residual energy.

Since the subharmonics are also harmonic in nature, we can process them using the WBVPM technique used for the regular harmonics, except now we are applying it to just the residual.

The first step (after fundamental frequency estimation) for WBVPM is calculate the minimally-phased onsets, which is done by the MFPA algorithm. The original MFPA algorithm[1][7] was done using a phase-vocoder with a constant window and hop size. However, I have thought that using a pitch-synchronous analysis instead might improve the accuracy, and thus improve the result of WBVPM too. Additionally, if that is true, using a border interpolation similar to that used in WBVPM also might help.

Since every Mth subharmonic corresponds to a harmonic as well, it will have been already filtered out by the original WBVPM harmonic estimation, and thus will be effectively zero. Assuming the amplitude of the subharmonics varies slowly over frequency, we can obtain a good approximation by calculating an envelope of the subharmonics and interpolating the missing subharmonics. Such an envelope calculation and interpolation is often done via a natural cubic spline.

We actually need to do this at two separate times. We need to do it first in the MFPA, and then again in WBVPM. For MFPA, we can interpolate the amplitude from the amplitude from the estimated amplitude envelope. For the phase, we could set it to be the same as the proceeding subharmonic, or from an interpolation of an estimated unwrapped phase envelope. Another option could be to just ignore the phase differentials from and to the missing subharmonic. However, this may decrease the quality of the MFPA estimation; and furthermore, when M is 2 (which I believe is actually the most common situation), excludes all phase differentials and is thus non-applicable.

We also calculate the missing subharmonics again in WBVPM. For the phase, we can’t just copy the previous subharmonic phase as that could result in significant phase errors around formants. The only reason that could work for MFPA is because we don’t care about the phase values themselves, but the differentials between them; so copying the previous value increases the next differential but sets the current one to zero, which is equivalent to if we had the actual subharmonic phase assuming that phase lies in between the two adjacent known subharmonic phases.

We can calculate the phase for the missing subharmonics again in WBVPM by either the unwrapped phase envelope interpolation, or by creating an artificial phase envelope from the amplitude envelope as described in [1]. The latter wasn’t an option in the MFPA onset determination stage as it requires the window to be centered on the MFPA onset. The latter may also interfere with the ability to extract further residual from the subharmonic WBVPM, depending on which algorithm is used and its implementation.

We can also subtract the estimated missing subharmonics from their corresponding harmonics in the original WBVPM harmonic estimation to improve the estimation of the harmonics. I am not sure what the relation of phase would be in this situation. If we are using the method from [1] to reconstruct the phase from the amplitude for the harmonic WBVPM pulse spectrums, we could just recompute said phase spectrum after the subtraction.

After the subharmonic WBVPM is ran, we can further extract residual using the methods from the previous section. This residual may be yet still processed, for example to separate transients and stochastic residual, and apply separate processing to them; for example, not applying time-scaling and pitch-transposition to the former, and/or modeling the latter as filtered white noise.

Synchronization

It may be that since we have a separate onset sequence for the harmonic and subharmonic pulses, with the latter being more sparse by a factor of M, that applying transforms to said onset sequences introduces a de-synchronization between the two that produces an unnatural. Perhaps this could solved by assigning each subharmonic pulse to the closest M harmonic pulses, calculating the center of said harmonic pulses as collective, and then calculating the differential between said collective center and the center of the subharmonic pulse. Then, at synthesis, after any onset sequence transforms have been applied, we could find again M closest harmonic pulses in the transformed harmonic pulse sequence, calculate the center of the collective of that group of pulses, and then finally set the time of the subharmonic pulse to said collective center plus the offset that was obtained for the subharmonic pulse at analysis. Perhaps said offset could be scaled by a factor determined from the pitch transposition factor and the time-scaling factor at that time, if either or both of those transforms had been applied.

To reduce jitter from this synchronization process, perhaps we could apply a method similar to approach used for refining the MFPA onsets in [1].

References
  • 1. Bonada Sanjaume, Jordi. “Voice Processing and Synthesis by Performance Sampling and Spectral Models” 2008, PhD Dissertation, Pompeu Fabra University
  • 2. Bonada Sanjaume, Jordi. “Wide-Band Harmonic Sinusoidal Modeling” 2008, International Conference on Digital Audio Effects
  • 3. Bonada Sanjaume, Jordi. “Audio Time-Scale Modification in the Context of Professional Audio Post-production” 2002, Pompeu Fabra University
  • 4. Bonada Sanjaume, Jordi; Celma, Òscar; Loscos, Àlex; Ortolà, Jaume; Serra, Xavier. “Singing Voice Synthesis Combining Excitation plus Resonance and Sinusoidal plus Residual Models” 2001, International Computer Music Conference
  • 5. Bonada Sanjaume, Jordi; Umbert, Martí; Blaauw, Merlijn. “Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016” 2016, Proceedings of Interspeech 2016
  • 6. Loscos, Alex; Bonada Sanjaume, Jordi. “Emulating Rough and Growl Voice in Spectral Domain” 2004, International Conference on Digital Audio Effects
  • 7. Bonada Sanjaume, Jordi. “High Quality Voice Transformations Based on Modeling Radiated Voice Pulses in Frequency Domain” 2004, International Conference on Digital Audio Effects

queuesevenm
http://queuesevenm.wordpress.com/?p=80
Extensions
Three Improvements to Wide-Band Voice Pulse Modeling
Uncategorizedaudio-processingdigitial-signal-processingvocal-synthesiswide-band-voice-pulse-modeling
In the last post about my vocal synthesis project, I talked about implementing the Wide-Band Voice Pulse Modeling algorithm. Since then, I’ve actually done some original research of my own and have devised what I believe to be three minor improvements to the algorithm. Wide-Band Voice Pulse Modeling is, as the name would suggest, an […]
Show full content

In the last post about my vocal synthesis project, I talked about implementing the Wide-Band Voice Pulse Modeling algorithm. Since then, I’ve actually done some original research of my own and have devised what I believe to be three minor improvements to the algorithm.

Wide-Band Voice Pulse Modeling is, as the name would suggest, an algorithm that models a voice sound as a sequence of voice pulses. Broadly speaking, in digital audio processing, there are two ways of processing the signal. One is to process it in the time-domain – that is, the evolution of pressure over time. And the other is in the frequency domain – typically by cutting the signal into small chunks called analysis windows and applying the fourier transform to each one to get an amplitude and phase spectrum.

Both of these techniques suffer from their own problems. Many sounds we want to process are harmonic sounds, that meaning that they are composed of a set of frequencies which are integer multiples of the fundamental frequency. There is also typically a small non-harmonic component, which is referred to as residual.

One problem with the frequency domain approaches is the issue of phasiness. For many types of harmonic sounds (including the human voice), they can be though of as a source and a set of formants, which arise from resonances. Formants appear as accentuations and attenuations in the frequency domain, but also as shifts in phase. At the start of each cycle of the fundamental frequency, the phase of each harmonic is roughly the same as the previous, except in the transitions between formants, where it shifts.

This property of the phases results in each cycle of the waveform having a characteristic ‘shape’ in the time-domain, which modeling accurately is essential for a natural sounding when processing.

Typical frequency domain techniques do not preserve this shape because the basis of their transforms is usually not at the flat phase onsets and thus the transforms will not be modifying the actual phase spectrum, but an offset one. This offset phase spectrum is only correct for the original frequencies and times of the pulse onsets, however, if any transforms are applied, it becomes incorrect and the resulting processed signal does not have the correct shape.

One approach for processing audio in the time domain is Time-Domain Pitch Synchronous Overlap and Add (TD-PSOLA). This technique works by detecting the fundamental frequency and cutting the signal up into segments of its period. Those segments can then be scaled or repeated and they are put back together.

Wide-Band Voice Pulse Modeling (from Dr. Jordi Bonada’s PhD thesis) is a hybrid time-frequency domain technique that is shape-invariant (i.e. it preserves the shape of each cycle/pulse), but has the advantage that frequency-domain transforms can be used.

It works by detecting the flat phase onsets and then extracting a window centered on the onset with the size being the period of the fundamental frequency. These windows are then each ran through a discrete fourier transform. As each window represents one cycle of the fundamental frequency, each bin in the spectrum represents an integer multiple of it (i.e. the harmonics). Before the discrete fourier transform is applied, a trapezoidal window function is applied to the window. It is one for most of the window. Towards the edges, it starts decreasing linearly until it reaches 0.5 at the edge of the window. Extensions with the same size of the border interpolation are extracted right outside of the window and are faded in a similar way. They are then added to the opposite edge within the window (so that they are one cycle apart). Since they should, in the ideal case, be the same, in that case, this transform has no effect. But when there are estimation errors, the errors cancel out towards the edges, which significantly improves the accuracy of the phase estimation.

I implemented the Wide-Band Voice Pulse Modeling algorithm via the upsampling method (specifically, upsampling via a natural cubic spline). There are actually two methods proposed in that paper, the other being via periodization. There is actually a patent that pertains to WBVPM, but it only covers the periodization version (which is what they used for their results), so I have implemented the upsampling method instead. I have been able to validate the main results in that paper; specifically, its shape-invariance and lower residual when compared to other methods. Furthermore, I have devised three significant improvements to the algorithm – two of which are only possible because I used the spline approach, so in a sense it was good that I had to do it that way.

Of the three improvements, I have implemented the first two and shown their advantage of the original WBVPM algorithm. The resulting score has been obtained by taking the mean of the relative of residual level (i.e. the difference between the original and reconstructed signal; relative to the level original signal). I have done so on an audio sample that deliberately exhibits traits that were noted as negatively affecting the WBVPM algorithm’s resulting quality. Notably, a low pitch voice with rapid and deep vibrato, transients, strong amplitude modulation, and a large portion of the sampling being between a voiced/unvoiced/voiced transition.

First I should note that my WBVPM implementation is currently far from optimal. The pitch estimation system (via the modified TWM algorithm) has not undergone testing and tuning of its parameters, and there are many variations of the TWM algorithm to consider. Additionally, I have not implemented unvoiced/voiced detection (because, as far as I can tell, it is not mentioned in Bonada’s thesis; presumably it’s in prior literature, but I have not researched it yet), so all the algorithms act as if they are always processing a voiced signal even when they are not.

RESILIENT BORDER INTERPOLATION IN SYNTHESIS – When I first implemented the synthesis step for WBVPM, it was late at night and I was tired. I wanted a quick result before I went to bed and didn’t understand the wording of the description of the synthesis step in WBVPM. As such, my original implementation differed significantly. Instead of using the overlap-and-add, it instead, for each sample, found the closest voice pulse and determined its value for that time, taking advantage of the spline that was generated for downsampling and using the periodic nature of the pulse to extend it when the sample was beyond its domain (i.e. the opposite of overlapping). This approach lead to high-frequency crackling artifacts due to discontinuities between the voice pulse boundaries.

The following day, I properly understood the synthesis approach and rewrote the synthesis code. Interestingly, this actually gave worse overall results. While the high frequency artifacts were gone, there were now large low frequency artifacts that appeared as large modulations in the time-domain. I eventually tracked this down to being a bug in my implementation of the MFPA algorithm that sometimes resulted in massive errors of up to 1.5 radians. I fixed this bug and the reconstruction synthesis no longer had significant artifacts, but I thought it was interesting that my approach, despite having the discontinuity issue, was more resilient to errors in the MFPA estimation. I began thinking if the two approaches could be combined to create an even better approach.

I was thinking about why the modulation occurred in the case of the overlap-and-add method. Thinking about it, when the fundamental frequency is stationary and the MFPA onsets are perfect, the trapezoidal window function is equivalent to a weighted average between two adjacent voice pulses over the duration of twice the border interpolation size. However, when the MFPA onsets are inaccurate, or even just when the fundamental frequency is non-stationary, this is no longer true. Even worse, thinking about it from the weighted average point of view, the sum isn’t necessarily one everywhere anymore, hence the modulation.

I then devised a method that would not result in modulation. This method works by first synthesizing the ‘inner’ portion of each pulse (by ‘inner’, I mean starting at the end of the border interpolation at the start, and ending before the start of the next border interpolation towards the end of the pulse). Then, for the gaps in between each pulse, we calculate each sample value by a weighted average of two values. These are values are the values of each voice pulse at that time. Since the gap extends beyond the boundaries of each voice pulse, we use the periodic nature of the pulses to compute the effective position in the voice pulse by taking the position modulo the period of the fundamental frequency at that voice pulse. The fundamental frequencies of each of the voice pulses may differ, so we actually change step in time linearly. At each end of the gap, the step size for the voice pulse it is next to is one sample in time, while the step for the former voice pulse is the equivalent of one sample in the latter voice pulse relative to the former’s fundamental frequency (e.g. if the second voice pulse has twice the fundamental frequency as the first; the step size for the first would be 2 and tep size for the second would be 1, at the end of the gap). For the start of the gap, it is the same except relative to the first pulse having a step of 1. In between, we the step size interpolate linearly.

It is worth noting that in the ideal case where the onsets are exactly correct and the fundamental frequency is stationary, the result of this approach is the same as using the trapezoidal window.

FREQUENCY WARP-CORRECTION – As noted in Bonada’s thesis, WBVPM assumes that the fundamental frequency is stationary within each pulse, however this is not actually true, and that the artifacts from this are particularly apparent for low fundamental frequency voice signals, because each period of the signal is longer in time and thus the internal state of the system has more time to change.

One of the changes that can happen over time is modulation of the fundamental frequency. This can be thought of actually as a time-domain remapping function that distorts each voice pulse according to a continuous fundamental frequency trajectory.

I discovered a way of correcting this, largely by accident as I was thinking about solving the modulation issue I discussed in the previous section. I was thinking about how I proposed changing the step size linearly in the gaps between the ‘inner’ pulses. I was thinking, we have a discrete sequence of fundamental frequencies. So, what if instead of changing the step size linearly, we instead created a spline from the fundamental frequencies and instead changed the step size based on that? Then I realized that we could also use this for the whole voice pulses and just sample everything with a step size based on the fundamental frequency trajectory. I then realized that this would actually act like the distortion from changing parameters within each voice pulse, at least in the synthesis stage. Further more, since we are already computing splines for each voice pulse to downsample it, this comes at very little additional computational cost.

However, the voice pulses in analysis are already distorted. So then I already we can do the inverse resampling in the upsampling stage of WBVPM analysis to correct for non-stationary frequency, then it is redistorted according to the transformed fundamental frequency trajectory in the synthesis stage. This makes this method effectively invariant to modulations in fundamental frequency, so long as the modulation is less than the fundamental frequency and it is modeled well by the spline, which should be the case for modulation period is at least several voice pulses.

PITCHED/UNPITCHED DECOMPOSITION – As mentioned in Bonada’s thesis, WBVPM only models sinusoids, and thus residual in the input signal is encoded as flucuations between the spectra of voice pulses. I devised a post-processing technique to separate the voice pulses into sinusoidal and residual components.

I have not actually implemented and test this yet, because as it is post-processing step, it will not improve the residual level, and in fact, will probably make it worse. The benefit of it is in the transformation stage, which is much harder to quantify and which I have not finished implementing. However, I believe this approach should work.

The technique works as follows:
a) First, for each voice pulse, and then for each harmonic of its spectra, we compute a spline based on the values of the amplitude of that harmonic in the voice pulse as well as a fixed number of surrounding voice pulses.
b) Since the time delta between voice pulses can vary, we then resample each local harmonic spline with fixed steps in time.
c) We compute the fourier transform of these resampled local harmonic trajectories
d) We apply a low-pass and high-pass filter to separate it into low-frequency and high-frequency components.
e) We then apply the inverse fourier transform to each of these. We can then sample the low pass trajectory at the time of the voice pulse to get the amplitude value of the denoised harmonic for that voice pulse. The same can be done for the high pass trajectory to obtain a pseudo-pulse representing the residual. These residual voice pulses can then be synthesized using the WBVPM synthesis method to obtain a time-domain residual signal which can be processed separately from the main harmonic signal.

A significant source of error in this process presumably would come from the resampling step. This can be decreased by using a smaller time step, at an increased computational cost. However, the error could probably be greatly reduced by first calculating the difference between the original amplitudes and the amplitudes at the same times in a spline computed from the resampled harmonic spline before applying the band filters, this difference can later be added back to the low-pass amplitude trajectory.

The denoised harmonic phase can also be computed via the same method, using Bonada’s method for unwrapping phase across both frequency and time. The residual phase can be calculated by taking difference of the original phase from the denoised phase and dividing it by the residual amplitude.

Results

I have tested the first two improvements and obtained the following results for the aforementioned audio sample:

Original WBVPM: -36.355dB
Warp-correction improvement only: -36.74595dB
Warp-correction & Resilient border interpolation in synthesis: -37.41177dB

More research is needed to properly evaluate these improvements across more samples with more variety, and to see if these techniques still result in improvements with more accurate pitch and MFPA estimation and with proper handling of unvoiced/voiced frames.

queuesevenm
http://queuesevenm.wordpress.com/?p=59
Extensions
Shape-invariant transforms using Wide-Band Voice Pulse Modeling
Uncategorizeddspsignal-processingsinging-synthesissynthesissynthesizervocal-synthesisvocaloid
I have been working on a project for some time to perform singing synthesis using the same method as the “VOCALOID” software. I’ve been following the paper that was the culmination of the research that VOCALOID was based on, Voice Processing and Synthesis by Performance Sampling and Spectral Models. After quite some work, I have […]
Show full content

I have been working on a project for some time to perform singing synthesis using the same method as the “VOCALOID” software. I’ve been following the paper that was the culmination of the research that VOCALOID was based on, Voice Processing and Synthesis by Performance Sampling and Spectral Models.

After quite some work, I have achieved a major milestone, a shape-invariant pitch transposition.

First the original audio: https://files.catbox.moe/zmt3rr.wav
Now my version with WBVPM (pitched down by an octave): https://voca.ro/1mJ5qljrp9hD

And a version using a naive pitch shift: https://files.catbox.moe/xs39bq.wav

Notice that my version, while having more noise, sounds more natural and has less phasiness. This is particular noticeable if you play both at very low volume. One sounds much more ‘human’ than the other.

Also note that this an extreme example with an octave shift (or 1200 cents) – in practice, shifts would typically be far less. Also this doesn’t implement several other parts of the system (more on that later).

I’ll explain all of this in a moment, but first, I’d like to give some relevant bibliographical details.

Bibliography

For a long time, I had thought that VOCALOID1 used Narrow-Band Voice Pulse Modeling while VOCALOID2 and onwards used Wide-Band Voice Pulse Modeling. This was incorrect, and additionally it was the source of most of my confusion surround the paper.

What actually happened is that the research technology that would later become VOCALOID1 started out as work to improve the existing Spectral Modeling Synthesis system that had been developed in the early 1990s. This improvement began work in the late 1990s. But importantly, this system evolved and techniques from it were incorporated with techniques from a system that was being developed called a Phase-Locked Vocoder, and this system would be released as VOCALOID1. In the mid-2000s, work began on combining the techniques learned from improving SMS and the PLVC-based system and attempting to combine them with the mucher older and well-known TD-PSOLA system. Importantly, TD-PSOLA (Time-Domain Pitch Synchronous OverLap and Add) was a time-domain system, while SMS was a frequency-domain system (and also TD-PSOLA was pitch synchronous – hence the name, while SMS had a constant hop size). The first technique they developed was Narrow-Band Voice Pulse Modeling, and later Wide-Band Voice Pulse Modeling. Wide-Band Voice Pulse Modeling ended it up being used in VOCALOID2.

Now that I understand this, I also understand the major mistake I made when reading the paper: I was reading it from the perspective of an implementer, thinking of the sections as the steps to implementing it instead of as research. I had thought that section 2.2 described the core processing algorithms. When it was actually about SMS, and importantly, about the improvements they made to SMS, and not a complete description of SMS, since SMS was already an established technique. Hence my confusion on why some things were seemingly vaguely explained, since the paper wasn’t about them. At the same time, much of that section is very useful though because importantly, much of that research was also incorporated into the later techniques.

Results

I have successfully implemented the Wide-Band Voice Pulse Modeling; synthesis; and pitch transposition, time stretching, and timbre scaling algorithms. Additionally, I have also finished implementing the full version of the pitch estimation module, changed the code to work using overlapping windows, implemented the window adaption system, and fixed countless.

Importantly, I have been able to experimentally replicate a very important property – and one of the main reasons WBVPM was developed, in fact. That property is shape-invariance. You see, an important property of the human voice is that, all else being equal, the shape of each pulse in the waveform stays roughly the same regardless of frequency. The reason for this property is phase-coherence. At the start of each voice pulse (when the glottis closes), the phases of all the harmonics within each formant (where each ‘formant’ is a spectral region affected by the vocal tract differently) are roughly the same. Since phase changes proportionally to frequency, the different harmonics will shift from that point over time, and soon become very different from one another. Since the phases are vastly different with relation to the frequency at times other than the voice pulse onset, the harmonics interfere constructively and destructively in the time-domain. Importantly however, if all the harmonics are scaled equally, the phases all now change at a slower or faster rate, but importantly this rate scales the same for all of them. This gives rise to shape-invariance, since the pattern of interference stays the same, just at different scales.

Importantly, if you apply a transform relative to a point that is not a voice pulse onset, the phases will not be flat. Of course, that transform can shift the changes from the point it started from, but importantly it is NOT accounting for inherent phase shift that occurs from not being at a voice pulse onset. Of course, if no transform is occurred, then there will be no issue. But if one is, say a pitch transposition, then the initial phases from the starts if signal was actually shifted to the pitch originally will differ considerably from the observed ones since the observed ones base themselves on the measured phase at a different pitch. This results in the breaking of shape-invariance, a noticeable ‘phasiness’ sound, and the transformed sound sounding un-human.

500 samples from the original signal
1000 from a one octave down pitch transposition using a naive approach (a fixed-window and hop-size approach using a 1024-point Hann window)

Notice that not only is the waveform unrecognizable compared to the original, it even varies considerably between individual voice pulses!

Now compare to 1000 samples from the WBVPM approach:

Notice how the waveform is almost identical, only scaled up two times in period, and it varies much less.

You may be wondering, couldn’t we just downsample or upsample the signal and play it back at the same sample rate to get the same result? Well, importantly, we have independent control over pitch and time. In the example, I downsampled the voice by a factor two, but kept the time the same and it contains the same number of samples as the original. Additionally, in the analysis and then synthesis reconstruction, it is seperating it into individual voice pulses. Importantly, it isn’t just scaling them, it is generating new voice pulses in the frequency domain and inserting them at positions that were also generated.

Amplitude envelope of the latter half of the original audio
Amplitude envelope of the latter half of the pitch-transposed audio

Notice how they a roughly the same. If the audio was just downsampled, the transposed envelope would be stretched out by a factor of two – but is not.

I have also implemented timbre-scaling, although I have not tested it. Fun fact; when I implemented it, I actually did so by accident. I was trying to implement the pitch transposition, got a bit confused, and realized I had also accidently implemented timbre scaling.

All these transforms are currently implemented as linear transformers. However, they are all implemented by just sampling a spline at a regular interval, so they could be trivially made to accept a non-linear parameter, sequence of points, or spline instead.

Although this current implementation is far from perfect, I think it works reasonably well as a demonstration of the techniques and their properties. Keep in my mind that I have done nearly no adjustment of the constant parameters/’tuning’. In fact, there are several parameters whose corresponding feature is effectively disabled because I wasn’t sure what value to pick. This implementation could probably be considerably improved just by adjusting a few constant. An (hopefully) efficient and accurate way of ‘tuning’ automatically is discussed later in this post.

Additionally, I have also tested reconstructing the sound with no transforms. This seems yield little residual, although it seems concentrated at higher frequencies, so maybe that can be fixed. Maybe it could also be caused by aliasing.

Spectrogram of the original audio
Spectrogram of the reconstructed audio

And here’s the reconstructed sound: https://files.catbox.moe/bhxjpw.wav

This is still a simplified model. It does not take into the Excitation plus Resonance model, the Spectral Voice Model. It uses a linear transform and not generated trajectories.

Pitch Estimation

Throughout this project, the most finicky part has been – and continues to be – the pitch estimation; specifically the Two-Way Mismatch algorithm for monophonic pitch detection. I have compiled several variations of the TWM algorithm. I tested one change that worked by scaling a term by the amplitude (I had actually though of this idea myself, and this term happened to be the term I mentioned causing me trouble last time), and it led to considerable improvement so I kept it.

There’s also the adaptive window procedure that wraps the TWM f0 estimation. One thing I had been noticing for a long time was that Kaiser-Bessel beta values about 10% higher than the recommended values given in Cano 1998 seemed to perform much better. I had assumed this was just because of issues with my code, or the audio samples I was testing on. Much later, I was experimenting in python when I noticed a function called kaiser_beta which converted something else abbreviated to ‘a’ to the equivalent beta value. Previously in Cano 1998 and in other places, I had seen the Kaiser-Bessel parameter as alpha instead of beta. Up until this point, I had either not paid attention to this, or I had assumed that these had referred to the same thing. I did some research and found out that it converts between attenuation and the beta value for the Kaiser-Bessel window. Then I found that there is indeed an alpha form of the parameter and it is not that same as beta. Confusingly however, it is not attenuation, but both abbreviate to the same thing. The Kaiser-Bessel beta can be determined by just multiplying the alpha value by pi. Interestingly, this is much higher than the 10% I tested, however it seemed to perform better (or at least not worse) anyway. A possible explanation for this discrepancy is that the adaptive window is larger than the window I used to test the adjustment originally.

Another improvement relating to windows is the window used for the harmonics that are fed into MFPA. Originally, I had used the same Kaiser-Bessel window for both. I later switched to a Blackman-Harris -92dB window, which I had seen mentioned in the paper. This resulted in a significant improvement. Another improvement I tried was adapting the window size to a value relative to the period of the estimated fundamental frequency. I tried doing this – using the same number of periods as are used for the Kaiser-Bessel window used for TWM – and noted a substantial improvement, even more so than the improvement from switching to the Blackman-Harris window in the first place. Indeed, this matches the results contained in the study. In the WBVPM section, they observe a considerable improvement (up to -10dB) when using an adaptive window size when compared to a fixed window for narrow-band analysis. In that same section, they also found 2 to be the ideal number of periods for minimizing noise and also did experiments with a Hann window. Perhaps experimenting with these ideas could lead to improvements, although that section is about getting an accurate spectrum and reconstruction, which may differ somewhat in needs from the needs of MFPA. Another idea could be using a separate function for determining its adaptive number of periods, as opposed to using the same value as for the Kaiser-Bessel window as I am currently doing. Perhaps always using an integer number of periods could be beneficial. Another idea is only using one or two periods, which would provide better time resolution, and could be better suited for wide-band analysis as we are doing. Another potential improvement I have thought of, but not yet tested, is modifying the constant parameters of the Blackman-Harris window in a manner similar to the method Cano 1998 describes for the Kaiser-Bessel window beta (and that I have used for that), where the constant parameters (or in this case, parameters) are modified in accordance with the fundamental frequency.

Another potential for improvement of the MFPA results could be the use of a peak selection algorithm. I had previously used a very simple one I had found on another resource by the UPF Music Technology Group. Although this algorithm did not seem to show an improvement. I later removed it and saw no observable detriment. The paper does not provide details on this specifically, but I now understand why, so I should do more research with how this was tackled in SMS. One idea I’ve though of myself is to calculate the estimated harmonics and then search the surrounding area for peaks. We then select the peak with the minimum error, where that error is determined based on distance and amplitude. One formula I have thought for the error calculation but not tested is amplitude / distance^2. We want to search far enough to always have the best candidate, but not too far is to be computationally inefficient or run into floating-point error and instability in the error function. A potential improvement to this approach is instead of determining the initial estimate for the harmonic frequency by multiplying the fundamental frequency by the harmonic index, we could instead add the fundamental frequency to the peak that was chosen to be the last harmonic. This would account for drift caused by inaccuracies in the f0 estimation and also distortion in the harmonics. However, this also runs the risk of drifting away from the harmonics. A possible solution to this issue could be blending this estimated harmonic frequency with the one obtained by multiplication with the fundamental frequency. This could act as a sort of course correct that would work gradually, but at the same time keep the benefits of basing it on the previous selected harmonic peak.

Another potential improvement could be found by fixing sudden jumps in fundamental frequency that last for only a few analysis frames and then return to roughly the same fundamental frequency as before the jump. Cano 1998 calls for a “hystheresis cycle” – though I am not sure exactly what that means. I have implemented a simple system that discarded large relative jumps that last for only a single frames. However, this has two major issues. The first is that these jumps often last for more than just one frame. The second is that if a legitimate jump in f0 that stays occurs, this introduces one frame of lag.

Maximally Flat Phase Alignment

Since my first successful tests with MFPA, I have made a number of improvements to this part of the system. I don’t believe I have made any changes to the core MFPA function itself, but I have made a lot of improvements to the MFPA refinement algorithm as well as the code surrounding MFPA.

One major improvement I made only recently. The previous issue stemmed from what I now believe to have been a misunderstanding. The MFPA algorithm gives a phase shift for each frame. This can be converted into a time offset. However, unless the frame-rate is exactly the same as the fundamental frequency (in the instantaneous sense), this will give more or fewer pulse onsets than actually exist. At the time, I was using a high-pitched sample for testing whose f0 was much faster than the analysis hop-size of 256 samples (or ~172 per second at 44.1kHz). Because of this, there were usually more than one pulse in between each detected pulse onset. At the time, I had thought that getting all the pulse onsets was the purpose of the MFPA refinement algorithm. Which is why I was confused that the it was described as choosing a subset of the pulses and not a superset. At the time, I had implemented the MFPA refinement algorithm, but it was buggy and either didn’t work or did nothing. Later, I began thinking of ways myself of getting the in between onsets. My ideas was to add increments of the f0 period until the next pulse was reached.

I eventually realized that the purpose of MFPA refinement algorithm was not interpolation, but to take a list of pulse onsets that could include multiple close estimates for the same pulse and narrow it down so there is only per pulse and such that the best one is chosen (actually it looks at a few additional candidates, which somewhat tripped me up into thinking it was about interpolation for long). For this to happen, the analysis hop-size needs to be greater than the fundamental frequency (if it was equal, it would likely slowly drift and eventually miss one onset). I realized the issue why the hop size was high (and thus the maximum frequency low) in the paper was that they were using low frequency audio samples in the range of 50-100Hz, while I was using samples around 300Hz. I adjusted to the hop size to 96 and got great results. I think I had also tried this before, but it had not worked, and it couldn’t have, because this is only possible without decreasing the size of the analysis window within the overlapping window framework, which I had not implemented yet at the time I first tried.

However, this low hop size is relatively computationally expensive, so much so that f0 and MFPA peaks take up most of the execution time. A possible improvement would be to use a lower analysis rate and actually use the interpolation method, but then feed the interpolated pulses into the MFPA refinement algorithm as you would likely get better results that way.

I have fixed numerous bugs within the MFPA refinement implementation. A noteworthy one is that previously, I was not considering that the analysis window’s time is in the center, and not the start. Because of that, the new onsets are now offset compared to the old ones, but I believe it is now correct.

The pulse onset selection is now quite good
A close-up of the same area in the waveform

Compare to a screenshot of my first successful test from about a month ago:

You can clearly see the result of the large analysis hop. Note that this doesn’t show a major issue I had back then, which was a gradual drift over time in the relative position in the waveform. This shows the start of the audio, while the new figures show it from around the middle.

However, there are still deviations even with the current code. Here is one at around 20k samples in one of my test audio samples:

So there is still some work to be done with respect to this part of the system, although I imagine most of the improve would stem from better f0 estimation and harmonic selection.

Another potential improvement could be the introduction of a system for detecting for formants and weighting them less in the MFPA calculation. Recall that phase is roughly constant within a formant, but not between them.

Starting Frames

In the audio samples I have provided so far, I have cut off the first part of the audio. The issue is with pitch estimation for early frames. Remember that are analysis window is multiple f0 periods in size. Because of this, it can’t fit at the start so it has to be decreased to a much smaller size. This is much more of an issue now that I have decreased the hop size substantially. I have now set it to skip the first few frames, because otherwise, the forced extremely small window size causes the whole pitch estimation system to irreversibly destabilize. I’ve been thinking of solutions to this problem. One solution could be to let the analysis window take on the full size it wants and pad the area before the start with zeros or maybe something else, this could also be used for the end. Possibly the most promising solution I have come up with, although I have not test any of these, is to back fill the previous the pulses with the first good estimated pulse onset minus integer multiples of the first good estimate of the fundamental frequency. This should work assuming both the first pulse and fundamental frequency estimate are good, the fundamental frequency stays relatively constant over the start section, and the start section only contains a few pulses. Luckly the last criteria will always be satisfied as the size of the start section is half the size of the window, and the number of pulses is then (window_size / period) / 2, but the window size in the adaptive framework is just a small number of periods, so we are left with the (mostly) constant adaptive_period_count / 2 as the number of pulses.

Regarding the patent issue, I have determined that it applies only to the specific technique in Bonada 2008 WBVPM of using periodization to achieve a real-sized discrete fourier transform. However, that section also another option, that being interpolation. I have implementated it and found it to work well. I did a test a found a noise level of about -140dB (for reference, 1ulp for a single-precision float is about -145dB), which is extremely negligible and comparable to the results in the study for the periodization technique. I have also added the ability to use a few extra samples on the side to improve the spline. However, I have not tested the consequences of this variation. I don’t know whether the original implementation did something like this.

The wording in the patent:

generating for each pulse a sequence of repetitions of said audio pulse, said audio pulse being repeated according to its own characteristic frequency; deriving frequency domain information associated with at least some of the sequences of repetitions of said audio pulses, each said sequences of repetitions of said audio pulse being represented as a vector of sinusoids based on the derived frequency, said vector of sinusoids corresponds to a sinusoidal series expansion of the specific audio pulse

Compare to this section in the paper:

PERIODIZATION: one period of the input signal is windowed with wR (n) , and repeated several times at the rate defined by T so that the FFT buffer of length M covers in the end several periods. The repetition implies interpolating both the signal samples and the window function. Then the resulting signal sr (n) is windowed by an analysis window function
wA (n) , and the spectrum obtained is actually the convolution of such analysis window response WA (f ) by the spectrum of Sr (f ) sampled at harmonic frequencies, i.e.
Xr (f ) = ∑WA (f − fk )Sr (fk ) (2.90) k
where actually Sr (f ) is the STFT of length T . In general, the frequencies of the spectral bins don’t correspond to the harmonic frequencies but to
M−1 −j2π b n ⎛bf ⎞
Xr(b)=∑xr(n)e M =Xr⎜ s⎟.

Bonada 2008, WBVPM, NON-INTEGER SIZE FFT

One thing I was thinking about was the part where they said that one of the disadvantages of WBVPM was not being able to separate harmonic and non-harmonic. I also read that the noise is embedded as fluctuations in the spectrum of each voice pulse and over time, which is what I had presumed because the information has to go somewhere.

I was thinking, what if you took each harmonic as the values and the pulse onsets times as the positions in a spline. Then interpolated at regular intervals. Then applied the fourier transform. Then separate the highest frequencies and the others. Take the others and apply the inverse Fourier transform, and then rebuild a spline from this and interpolate the values back at the onsets. I wonder if this would work.

There would be loss though because of the resampling steps. This could decreased by taking more samples. You could also apply a correction by sampling and sampling it back to calculate the resampling loss itself without the removal of the high frequency modulations, and then add this difference back to the main pulse information after the separation.

Tuning

I have come up with two techniques for tuning that apply in different ways.

AUTOMATIC TUNING – The idea is that we use a stochastic statistical algorithm that minimizes a cost function by adjusting a set of parameters (one I looked into that seems promising is global-optimization SPSA). The parameters in this cases would be constant used in C. A python script would replace placeholders with the values being picked by the minimization algorithm and then compile and run the C program. The results would then be compared to a reference by another algorithm/program, which would then be summed together to give a cost value. A program for doing I plan to research is called AudioVMAF. I believe it was originally designed to test audio compression, however I hope that it could also be useful here.

SEMI-AUTOMATIC TUNING – In this method, we insert instrumentation into various intermediate values calculated in the program. Then, for one very small snippet of audio, we use Automatic Tuning to determine ideal values. Then, a programmer tries to write code to make it better match these desired values. Then, if successful, it can be test in general over the whole dataset. If it is not an improvement, then the most negtaively affected audio snippets can be selected and then have a similar process to decrease the change for them while keeping the change for the ones that benefit.

Both of these methods would work best for matching with another vocal synthesizer, since the timings and parameters can match exactly. However, they may also be adaptable to optimizing parameters for real-world (and thus also realistic) voices. It would have to work somewhat in reverse though in that someone would sing first and then a note sequence would have to be made that matches it almost exactly.

Other Considerations

There are many more potential tweaks and improvements. I have many dozens accumulated and plenty more to research, test, and implement. One widely applicable variation is using logarithmic based scales.

I still don’t have an answer to the voiced/unvoiced frame decision issue, but I will look for SMS research about that and older Bonada papers. One heuristic I thought of is noise / amplitude^2 > threshold.

Download audio
Download audio
Download audio
queuesevenm
http://queuesevenm.wordpress.com/?p=8
Extensions