/plaza/ - General/Random

The place to post!


New Thread[×]
Name
Subject
Message
File* 10MB total
Oekaki
Password
Captcha*
[New Thread]


Screen_Shot_2026-04-18_at_11.57.41_PM.png
[Hide] (229.2KB, 936x2115)
In the last post about my vocal synthesis project, I talked about implementing the Wide-Band Voice Pulse Modeling algorithm. Since then, I've actually done some original research of my own and have devised what I believe to be three minor improvements to the algorithm.

I implemented the Wide-Band Voice Pulse Modeling algorithm (from Dr. Jordi Bonada's PhD thesis: https://www.tdx.cat/bitstream/handle/10803/7555/tjbs.pdf) via the upsampling method (specifically, upsampling via a natural cubic spline). There are actually two methods proposed in that paper, the other being via periodization. There is actually a patent that pertains to WBVPM, but it only covers the periodization version (which is what they used for their results), so I have implemented the upsampling method instead. I have been able to validate the main results in that paper; specifically, its shape-invariance and lower residual when compared to other methods. Furthermore, I have devised three significant improvements to the algorithm - two of which are only possible because I used the spline approach, so in a sense it was good that I had to do it that way.

Of the three improvements, I have implemented the first two and shown their advantage of the original WBVPM algorithm. The resulting score has been obtained by taking the mean of the relative of residual level (i.e. the difference between the original and reconstructed signal; relative to the level original signal). I have done so on an audio sample that deliberately exhibits traits that were noted as negatively affecting the WBVPM algorithm's resulting quality. Notably, a low pitch voice with rapid and deep vibrato, transients, strong amplitude modulation, and a large portion of the sampling being between a voiced/unvoiced/voiced transition.

First I should note that my WBVPM implementation is currently far from optimal. The pitch estimation system (via the modified TWM algorithm) has not undergone testing and tuning of its parameters, and there are many variations of the TWM algorithm to consider. Additionally, I have not implemented unvoiced/voiced detection (because, as far as I can tell, it is not mentioned in Bonada's thesis; presumably it's in prior literature, but I have not researched it yet), so all the algorithms act as if they are always processing a voiced signal even when they are not.

RESILIENT BORDER INTERPOLATION IN SYNTHESIS - When I first implemented the synthesis step for WBVPM, it was late at night and I was tired. I wanted a quick result before I went to bed and didn't understand the wording of the description of the synthesis step in WBVPM. As such, my original implementation differed significantly. Instead of using the overlap-and-add, it instead, for each sample, found the closest voice pulse and determined its value for that time, taking advantage of the spline that was generated for downsampling and using the periodic nature of the pulse to extend it when the sample was beyond its domain (i.e. the opposite of overlapping). This approach lead to high-frequency crackling artifacts due to discontinuities between the voice pulse boundaries.

The following day, I properly understood the synthesis approach and rewrote the synthesis code. Interestingly, this actually gave worse overall results. While the high frequency artifacts were gone, there were now large low frequency artifacts that appeared as large modulations in the time-domain. I eventually tracked this down to being a bug in my implementation of the MFPA algorithm that sometimes resulted in massive errors of up to 1.5 radians. I fixed this bug and the reconstruction synthesis no longer had significant artifacts, but I thought it was interesting that my approach, despite having the discontinuity issue, was more resilient to errors in the MFPA estimation. I began thinking if the two approaches could be combined to create an even better approach.

I was thinking about why the modulation occurred in the case of the overlap-and-add method. Thinking about it, when the fundamental frequency is stationary and the MFPA onsets are perfect, the trapezoidal window function is equivalent to a weighted average between two adjacent voice pulses over the duration of twice the border interpolation size. However, when the MFPA onsets are inaccurate, or even just when the fundamental frequency is non-stationary, this is no longer true. Even worse, thinking about it from the weighted average point of view, the sum isn't necessarily one everywhere anymore, hence the modulation.

I then devised a method that would not result in modulation. This method works by first synthesizing the 'inner' portion of each pulse (by 'inner', I mean starting at the end of the border interpolation at the start, and ending before the start of the next border interpolation towards the end of the pulse). Then, for the gaps in between each pulse, we calculate each sample value by a weighted average of two values. These are values are the values of each voice pulse at that time. Since the gap extends beyond the boundaries of each voice pulse, we use the periodic nature of the pulses to compute the effective position in the voice pulse by taking the position modulo the period of the fundamental frequency at that voice pulse. The fundamental frequencies of each of the voice pulses may differ, so we actually change step in time linearly. At each end of the gap, the step size for the voice pulse it is next to is one sample in time, while the step for the former voice pulse is the equivalent of one sample in the latter voice pulse relative to the former's fundamental frequency (e.g. if the second voice pulse has twice the fundamental frequency as the first; the step size for the first would be 2 and tep size for the second would be 1, at the end of the gap). For the start of the gap, it is the same except relative to the first pulse having a step of 1. In between, we the step size interpolate linearly.
Message too long. View the full text

sunny.jpg
[Hide] (278.7KB, 854x1280)
A little delayed, but submit your nominees for best boy of 2025. Ahead of the upcoming English translation of the manga and the 5th anniversary, I nominate Sunny.
2 replies and 2 files omitted. View the full thread
Tumblr_l_22713210101583.jpg
[Hide] (76.6KB, 580x280)
https://strawpoll.com/LVyK20GGdZ0

Well, Marzinons?
I am banned from voting as per marzicourt rules
jsid
>>9468
the vote-per-ip stats show that this is clearly a rigged poll
>>9476
Basilbro cope

afd8dfafe1c621acd0e47116e6ab5bd0.jpg
[Hide] (598.1KB, 2894x4093)
post characters that you think are marzichan-coded. patchouli is an obvious one
18 replies and 12 files omitted. View the full thread
banana.jpg
[Hide] (2.4MB, 3277x1872)
this lil guy
>>9109
green banana with the seed
x4a6F8Rp1w7l907o1_500.jpg
[Hide] (33.9KB, 500x410)
51ofQYfVj+L._AC_UF1000,1000_QL80__edit_498217852550539.jpg
[Hide] (45.5KB, 377x323)
>>9474
you know marzi doesn't even accept donations to run this beautiful website I find it really hyenous of you to come on here just to call him a crocodile like that you should be ashamed of yourself

1769430196242w.jpeg
[Hide] (17.4KB, 1085x355)
dead website
just shut it down at this point
40 replies and 7 files omitted. View the full thread
>>9419
go back
1775939957241-tegaki.png
[Hide] (10.8KB, 500x500) [Replay]
>>9420
>>9421
geg
>>9422
bib
Shartoid_expulsion.png
[Hide] (704.3KB, 815x824)
>>9419
You are not welcome here, just like you are not welcome anywhere else except for your retarded pedophile shithole and its associated splinters and Discord servers.  Go back there.

P.S.: Even Quote himself knows how utterly worthless you are, hence why he intends to shut down the Shitty this year and refuses to even sell it to anyone else.
>>9418
sybau unc frfr no cap cuh ong :wilted_rose:

1776003490767264.jpg
[Hide] (90.7KB, 800x450)
i wish i went to college

1775878583639047.jpg
[Hide] (61.6KB, 660x378)
"You wanna know how many fucking baked zitis we go through in this house?"

maxresdefault_(83).jpg
[Hide] (46.4KB, 1280x720)
Are these the world's most crispy fries? Let's find out.
crispy fries are the best
I love when places have the fries with the starch coating for extra crispness
They are.
I was going to make a comment on how a year's worth of studies funded my FDA grants and several Ivy League schools blah blah blah, but then I realized this thread was posted just three months ago. It's been a long three months.
>>7680
tell me about it
IMG_20260413_142656777_HDR.jpg
[Hide] (487.4KB, 1850x1968)
>>5954 (OP) 
https://youtu.be/Tv6WImqSuxA

Take_It_Easy!.webp
[Hide] (46.5KB, 600x334)
im hungry
3 replies and 1 file omitted. View the full thread
Replies: >>9391
>>9340 (OP) 
same... idk what to have for breakfast,,, and cooking is so much effort.. waahhhhhh,
>>9391
try greek yogurt with berries and granola
>>9392
So trve
>>9392
didnt have any of those things.
i had reheated pigs in blankets (as in wrapped in pastry not bacon) instead.
>>9397
homemade?

Ebil_at_computer.jpg
[Hide] (63KB, 772x594)
ITT: We be rude and mean.
18 replies and 10 files omitted. View the full thread
Replies: >>9382 + 3 earlier
GettyImages-140626219-1ca0aee.jpg
[Hide] (790KB, 2904x2085)
>>8347
Oh, I see how it is. Good luck on your escape miss! ...Oh, just one more thing, if you don't mind me asking... what exactly were you doing at the idiot store in the first place?
You have tiny balls.
>>8353
that's just compact design tho?
>>8349
cracked me up lmao
>>7647 (OP) 
Op yo mom is so hairy that every time she go jogging down the street, her legs look like two dogs are fighting

Screen_Shot_2026-03-30_at_4.51.32_AM.png
[Hide] (91.3KB, 1132x839)
Hello, I'm back and I have a major update to my VOCALOID project! I have sucessfully achieved a shape-invariant pitch transposition!

Here it is.
First the original audio: files.catbox.moe/zmt3rr.wav
Now my version with WBVPM (pitched down by an octave): voca.ro/1mJ5qljrp9hD or files.catbox.moe/kho97n.wav
And a version using a naive pitch shift: files.catbox.moe/xs39bq.wav

Notice that my version, while having more noise, sounds more natural and has less phasiness. This is particular noticeable if you play both at very low volume. One sounds much more 'human' than the other.

Also note that this an extreme example with an octave shift (or 1200 cents) - in practice, shifts would typically be far less. Also this doesn't implement several other parts of the system (more on that later).

I'll explain all of this in a moment, but first, I'd to correct some major biographical errors. Since this is a long post, I've divided it into sections

BIOGRAPHICAL CORRECTIONS

Message too long. View the full text
3 replies omitted. View the full thread
Replies: >>9349
>>9255
Thank you!
Hello I'm back with another update to my VOCALOID project. It's not as big an improvement as last time - and in fact, there's no new features - but I felt like it was worth posting. I've been trying to rectify the major issues before I move onto implementing the Excitation plus Resonance model.

The first thing I attempted to tackle was all the added noise at high frequencies.
Here's the original spectrum: https://files.catbox.moe/fq55bo.png
And here's the reconstructed spectrum (with no transforms applied): https://files.catbox.moe/gq7jff.png
You can clearly see the high frequency artifacts. The first thing I tried was something mentioned in the paper. In the paper, specifically the WBVPM section, it was mentioned that there are two approaches for a non-integer size discrete fourier transform. The first one is repeating the signal while second is upsampling it. I went with second as the former is patented and also because the second is easier to implement. It is mentioned that increasing the repetition count of the signal (or in the case of upsampling, the upsampling factor), and then discarding the higher frequencies, can improve the estimation by reducing artifacts. In the case of repetition, it is also mentioned that quadratic interpolation can be used in the resulting spectrum, however I am not sure if this can be done for upsampling and as such, I have not tried to implement it for now.

Here's the result after applying an upsampling factor of 3: https://files.catbox.moe/qcgnzq.png
Here's the original audio: https://files.catbox.moe/f7g8ta.wav
The original reconstruction: https://files.catbox.moe/da0m1i.wav
And now with the improved reconstruction: https://files.catbox.moe/513ycn.wav
You can see an improvement, especially at lower frequency, however the high frequency artifacts largely persist. So they have to be arising elsewhere. I realized the source was the reconstruction of the signal (AKA the "synthesis"). I had previously implemented a synthesis method that was quite different from the one used in the study, because I did not understand the method in the study at first. My synthesis method worked by taking each voice pulse and for each sample where the voice pulse is the closest voice pulse to that sample, setting the value of that sample to the interpolated value of a spline representing a time domain version of the upsampled voice pulse with a step corrospondin between the ratio a sample in the regular time domain and the upsampled time domain. Now, in some cases, estimation inaccuracies and differences from any transformations that were applied result in these regions of samples being bigger than the actual sample itself. In these cases, we take advantage of the period nature of the voice pulse and repeat it (i.e. sampling before the start is equivalent from that offset from the end, and sampling after the end is the same as that offset from the start). However, this method results in discontinuities in some cases.
Here is an example of such a  discontinuity: https://files.catbox.moe/jnnxfj.png
I began to try to implement an interpolation system. In this system, we could calculate the gap between pulses - or in the cases of inaccuracies in the other direction (i.e. overlapping pulses) - the overlapping area, and interpolate between one pulse and the other linearly. However, this was approach was complicated significantly by the non-integer (and potentially differing) sizes of the pulses as well as numerous edge cases. For this reason, I struggled to do so and spent over an hour trying to figure out how to do it corrrectly. About half way through, I decided to check the paper again and this time I understood the actual synthesis method properly, largely because of a diagram I had missed the first time.
In the actual method, each pulse is is expanded in a manner similar to that of the border interpolation technique used in WBVPM analysis, except kind of in reverse. In this technique, for each voice pulse, we generate extensions on both sides with each extension having the size of the border interpolation ratio of the size of the voice pulse. Then we apply a trapezoidal window to the voice pulse which starts at zero at each side of the extended voice pulse and becomes 1 on either side after protrusion of twice the border interpolation size on each side. Then we overlap and add the voice pulses.
Message too long. View the full text
woag.gif
[Hide] (449.1KB, 220x220)
>>9252 (OP) 
I'd be lying if I said I understood anything but I am most impressed and ergo proxy super proud 🥺
>>9349
thank you!!
>>9350
You're welcome.

Show Post Actions

Actions:

Captcha: