Hello, I'm back and I have a major update to my VOCALOID project! I have sucessfully achieved a shape-invariant pitch transposition!
Here it is.
First the original audio: files.catbox.moe/zmt3rr.wav
Now my version with WBVPM (pitched down by an octave): voca.ro/1mJ5qljrp9hD or files.catbox.moe/kho97n.wav
And a version using a naive pitch shift: files.catbox.moe/xs39bq.wav
Notice that my version, while having more noise, sounds more natural and has less phasiness. This is particular noticeable if you play both at very low volume. One sounds much more 'human' than the other.
Also note that this an extreme example with an octave shift (or 1200 cents) - in practice, shifts would typically be far less. Also this doesn't implement several other parts of the system (more on that later).
I'll explain all of this in a moment, but first, I'd to correct some major biographical errors. Since this is a long post, I've divided it into sections
BIOGRAPHICAL CORRECTIONS
In the last post, I claimed that VOCALOID1 used Narrow-Band Voice Pulse Modeling while VOCALOID2 and onwards used Wide-Band Voice Pulse Modeling. This was incorrect, and additionally it was the source of most of my confusion surround the paper.
What actually happened is that the research technology that would later become VOCALOID1 started out as work to improve the existing Spectral Modeling Synthesis system that had been developed in the early 1990s. This improvement began work in the late 1990s. But importantly, this system evolved and techniques from it were incorporated with techniques from a system that was being developed called a Phase-Locked Vocoder, and this system would be released as VOCALOID1. In the mid-2000s, work began on combining the techniques learned from improving SMS and the PLVC-based system and attempting to combine them with the mucher older and well-known TD-PSOLA system. Importantly, TD-PSOLA (Time-Domain Pitch Synchronous OverLap and Add) was a time-domain system, while SMS was a frequency-domain system (and also TD-PSOLA was pitch synchronous - hence the name, while SMS had a constant hop size). The first technique they developed was Narrow-Band Voice Pulse Modeling, and later Wide-Band Voice Pulse Modeling. Wide-Band Voice Pulse Modeling ended it up being used in VOCALOID2.
Now that I understand this, I also understand the major mistake I made when reading the paper: I was reading it from the perspective of an implementer, thinking of the sections as the steps to implementing it instead of as research. I had thought that section 2.2 described the core processing algorithms. When it was actually about SMS, and importantly, about *the improvements they made to SMS*, and not a complete description of SMS, since SMS was already an established technique. Hence my confusion on why some things were seemingly vaguely explained, since *the paper wasn't about them*. At the same time, much of that section is very useful though because importantly, much of that research was also incorporated into the later techniques.
RESULTS
I have successfully implemented the Wide-Band Voice Pulse Modeling; synthesis; and pitch transposition, time stretching, and timbre scaling algorithms. Additionally, I have also finished implementing the full version of the pitch estimation module, changed the code to work using overlapping windows, implemented the window adaption system, and fixed countless.
Importantly, I have been able to experimentally replicate a very important property - and one of the main reasons WBVPM was developed, in fact. That property is shape-invariance. You see, an important property of the human voice is that, all else being equal, the shape of each pulse in the waveform stays roughly the same regardless of frequency. The reason for this property is phase-coherence. At the start of each voice pulse (when the glottis closes), the phases of all the harmonics within each formant (where each 'formant' is a spectral region affected by the vocal tract differently) are roughly the same. Since phase changes proportionally to frequency, the different harmonics will shift from that point over time, and soon become very different from one another. Since the phases are vastly different with relation to the frequency at times other than the voice pulse onset, the harmonics interfere constructively and destructively in the time-domain. Importantly however, if all the harmonics are scaled equally, the phases all now change at a slower or faster rate, but importantly this rate scales the same for all of them. This gives rise to shape-invariance, since the pattern of interference stays the same, just at different scales.
Importantly, if you apply a transform relative to a point that is not a voice pulse onset, the phases will not be flat. Of course, that transform can shift the changes *from* the point it started from, but importantly it is NOT accounting for inherent phase shift that occurs from not being at a voice pulse onset. Of course, if no transform is occurred, then there will be no issue. But if one is, say a pitch transposition, then the initial phases from the starts if signal was actually shifted to the pitch originally will differ considerably from the observed ones since the observed ones base themselves on the measured phase *at a different pitch*. This results in the breaking of shape-invariance, a noticeable 'phasiness' sound, and the sound sounding un-human.
Here is an image of 500 samples from the original signal: files.catbox.moe/223l7p.png
Now here is 1000 from a one octave down pitch transposition using a naive approach (a fixed-window and hop-size approach using a 1024-point Hann window): files.catbox.moe/jxgtg0.png
Notice that not only is the waveform unrecognizable compared to the original, it even varies considerably between individual voice pulses!
Now compare to 1000 samples from the WBVPM approach: files.catbox.moe/6hpf8l.png
Notice how the waveform is almost identical, only scaled up two times in period, and it varies much less.
You may be wondering, couldn't we just downsample or upsample the signal and play it back at the same sample rate to get the same result? Well, importantly, we have independent control over pitch and time. In the example, I downsampled the voice by a factor two, but kept the time the same and it contains the same number of samples as the original. Additionally, in the analysis and then synthesis reconstruction, it is seperating it into individual voice pulses. Importantly, it isn't just scaling them, it is generating new voice pulses in the frequency domain and inserting them at positions that were also generated.
Here is an amplitude envelope of the latter half of the original audio: files.catbox.moe/6zw5v5.png
Now here is an amplitude envelope of the latter half of the pitch-transposed audio: files.catbox.moe/2a3utu.png
Notice how they a roughly the same. If the audio was just downsampled, the second would be stretched out by a factor of two - but is not.
I also implemented timbre-scaling, although I have not tested it. Fun fact; when I implemented it, actually did so by accident. I was trying to implement the pitch transposition, got a bit confused, and realized I had also accidently implemented timbre scaling.
All these transforms are currently implemented as linear transformers. However, they are all implemented by just sampling a spline at a regular interval, so they could be trivially made to accept a non-linear parameter, sequence of points, or spline instead.
Although this current implementation is far from perfect, I think it works reasonably well as a demonstration of the techniques and their properties. Keep in my mind that I have done nearly no adjustment of the constant parameters/'tuning'. In fact, there are several parameters whose corresponding feature is effectively disabled because I wasn't sure what value to pick. This implementation could probably be considerably improved just by adjusting a few constant. An (hopefully) efficient and accurate way of 'tuning' automatically is discussed later in this post.
Notably, the pitch-transposed spectrum varies significantly, with some areas showing little residual and straight lines: files.catbox.moe/04naz0.png (those lines in some areas around the center of the spectrum are probably the aliasing artifacts Bonada 2008 mentioned in WBVPM when using upscaling, I will implement the method for avoiding them at some point)
Additionally, I have also tested reconstructing the sound with no transforms. This seems yield little residual, although it seems concentrated at higher frequencies, so maybe that can be fixed. Maybe it could also be caused by aliasing. Here is an audio file that was reconstructed through the WBVPM synthesis procedure using downsampling: files.catbox.moe/bhxjpw.wav
And a spectrogram: files.catbox.moe/kvrnkq.png
Compare to the spectrogram of the original: files.catbox.moe/d1vo16.png
PITCH ESTIMATION
Throughout this project, the most finicky part has been - and continues to be - the pitch estimation; specifically the Two-Way Mismatch algorithm for monophonic pitch detection. I have compiled several variations of the TWM algorithm. I tested one change that worked by scaling a term by the amplitude (I had actually though of this idea myself, and this term happened to be the term I mentioned causing me trouble last time), and it led to considerable improvement so I kept it.
There's also the adaptive window procedure that wraps the TWM f0 estimation. One thing I had been noticing for a long time was that Kaiser-Bessel beta values about 10% higher than the recommended values given in Cano 1998 seemed to perform much better. I had assumed this was just because of issues with my code, or the audio samples I was testing on. Much later, I was experimenting in python when I noticed a function called kaiser_beta which converted something else abbreviated to 'a' to the equivalent beta value. Previously in Cano 1998 and in other places, I had seen the Kaiser-Bessel parameter as alpha instead of beta. Up until this point, I had either not paid attention to this, or I had assumed that these had referred to the same thing. I did some research and found out that it converts between attenuation and the beta value for the Kaiser-Bessel window. Then I found that there is indeed an alpha form of the parameter and it is not that same as beta. Confusingly however, it is not attenuation, but both abbreviate to the same thing. The Kaiser-Bessel beta can be determined by just multiplying the alpha value by pi. Interestingly, this is much higher than the 10% I tested, however it seemed to perform better (or at least not worse) anyway. A possible explanation for this discrepancy is that the adaptive window is larger than the window I used to test the adjustment originally.
Another improvement relating to windows is the window used for the harmonics that are fed into MFPA. Originally, I had used the same Kaiser-Bessel window for both. I later switched to a Blackman-Harris -92dB window, which I had seen mentioned in the paper. This resulted in a significant improvement. Another improvement I tried was adapting the window size to a value relative to the period of the estimated fundamental frequency. I tried doing this - using the same number of periods as are used for the Kaiser-Bessel window used for TWM - and noted a substantial improvement, even more so than the improvement from switching to the Blackman-Harris window in the first place. Indeed, this matches the results contained in the study. In the WBVPM section, they observe a considerable improvement (up to -10dB) when using an adaptive window size when compared to a fixed window for narrow-band analysis. In that same section, they also found 2 to be the ideal number of periods for minimizing noise and also did experiments with a Hann window. Perhaps experimenting with these ideas could lead to improvements, although that section is about getting an accurate spectrum and reconstruction, which may differ somewhat in needs from the needs of MFPA. Another idea could be using a separate function for determining its adaptive number of periods, as opposed to using the same value as for the Kaiser-Bessel window as I am currently doing. Perhaps always using an integer number of periods could be beneficial. Another idea is only using one or two periods, which would provide better time resolution, and could be better suited for wide-band analysis as we are doing. Another potential improvement I have thought of, but not yet tested, is modifying the constant parameters of the Blackman-Harris window in a manner similar to the method Cano 1998 describes for the Kaiser-Bessel window beta (and that I have used for that), where the constant parameters (or in this case, parameters) are modified in accordance with the fundamental frequency.
Another potential for improvement of the MFPA results could be the use of a peak selection algorithm. I had previously used a very simple one I had found on another resource by the UPF Music Technology Group. Although this algorithm did not seem to show an improvement. I later removed it and saw no observable detriment. The paper does not provide details on this specifically, but I now understand why, so I should do more research with how this was tackled in SMS. One idea I've though of myself is to calculate the estimated harmonics and then search the surrounding area for peaks. We then select the peak with the minimum error, where that error is determined based on distance and amplitude. One formula I have thought for the error calculation but not tested is amplitude / distance^2. We want to search far enough to always have the best candidate, but not too far is to be computationally inefficient or run into floating-point error and instability in the error function. A potential improvement to this approach is instead of determining the initial estimate for the harmonic frequency by multiplying the fundamental frequency by the harmonic index, we could instead add the fundamental frequency to the peak that was chosen to be the last harmonic. This would account for drift caused by inaccuracies in the f0 estimation and also distortion in the harmonics. However, this also runs the risk of drifting away from the harmonics. A possible solution to this issue could be blending this estimated harmonic frequency with the one obtained by multiplication with the fundamental frequency. This could act as a sort of course correct that would work gradually, but at the same time keep the benefits of basing it on the previous selected harmonic peak.
Another potential improvement could be found by fixing sudden jumps in fundamental frequency that last for only a few analysis frames and then return to roughly the same fundamental frequency as before the jump. Cano 1998 calls for a "hystheresis cycle" - though I am not sure exactly what that means. I have implemented a simple system that discarded large relative jumps that last for only a single frames. However, this has two major issues. The first is that these jumps often last for more than just one frame. The second is that if I legitimate jump in f0 that stays occurs, this introduces one frame of lag.
MAXIMALLY FLAT PHASE ALIGNMENT
My last post was about MFPA, since then, I have made a number of improvements to this part of the system. I don't believe I have made any changes to the core MFPA function itself, but I have made a lot of improvements to the MFPA refinement algorithm as well as the code surrounding MFPA.
One major improvement I made only recently. The previous issue stemmed from what I now believe to have been a misunderstanding. The MFPA algorithm gives a phase shift for each frame. This can be converted into a time offset. However, unless the frame-rate is exactly the same as the fundamental frequency (in the instantaneous sense), this will give more or fewer pulse onsets than actually exist. At the time, I was using a high-pitched sample for testing whose f0 was much faster than the analysis hop-size of 256 samples (or ~172 per second at 44.1kHz). Because of this, there were usually more than one pulse in between each detected pulse onset. At the time, I had thought that getting all the pulse onsets was the purpose of the MFPA refinement algorithm. Which is why I was confused that the it was described as choosing a *subset* of the pulses and not a superset. At the time, I had implemented the MFPA refinement algorithm, but it was buggy and either didn't work or did nothing. Later, I began thinking of ways myself of getting the in between onsets. My ideas was to add increments of the f0 period until the next pulse was reached.
I eventually realized that the purpose of MFPA refinement algorithm was not interpolation, but to take a list of pulse onsets that could include multiple close estimates for the same pulse and narrow it down so there is only per pulse and such that the best one is chosen (actually it looks at a few additional candidates, which somewhat tripped me up into thinking it was about interpolation for long). For this to happen, the analysis hop-size needs to be greater than the fundamental frequency (if it was equal, it would likely slowly drift and eventually miss one onset). I realized the issue why the hop size was high (and thus the maximum frequency low) in the paper was that they were using low frequency audio samples in the range of 50-100Hz, while I was using samples around 300Hz. I adjusted to the hop size to 96 and got great results. I think I had also tried this before, but it had not worked, and it couldn't have, because this is only possible without decreasing the size of the analysis window within the overlapping window framework, which I had not implemented yet at the time I first tried.
However, this low hop size is relatively computationally expensive, so much so that f0 and MFPA peaks take up most of the execution time. A possible improvement would be to use a lower analysis rate and actually use the interpolation method, but then feed the interpolated pulses into the MFPA refinement algorithm as you would likely get better results that way.
I have fixed numerous bugs within the MFPA refinement implementation. A noteworthy one is that previously, I was not considering that the analysis window's time is in the center, and not the start. Because of that, the new onsets are now offset compared to the old ones, but I believe it is now correct.
The pulse onset selection is now quite good: files.catbox.moe/ik27fw.png
Close up: files.catbox.moe/urby2w.png
However, there are still deviations. Here is one at around 20k samples in one of my test audio samples: files.catbox.moe/76nlgo.png
So there is still some work to be done.
Another potential improvement could be the introduction of a system for detecting for formants and weighting them less in the MFPA calculation. Recall that phase is roughly constant within a formant, but not between them.