Project management of NTIS P1 Cybernetic Systems and Department of Cybernetics | WiKKY

Project

General

Profile

Task #3970: Formants concatenation cost

This is part of RA3: Phonetically justified parameters for speech synthesis with parent task #3970.
The following description was taken from #3972.

We have hacked up the use of formats instead of MFCC in the concatenation cost (in addition to F0 and energy, which are computed as before). Each diphone candidate has a set of features (F0, energy, MFCC and now formant frequencies) assigned to its both beginning (b.) and end (e.). When two candidates, left (.L) and right (.R), available for the concatenation are evaluated, we compute:
  • the absolute value of the difference of z-score normalized energies, i.e. energy-cost = abs(enrg{eL} - enrg{bR})
  • the absolute value of the difference of z-score normalized F0 in case of voiced unit (either vowel or consonant), otherwise the cost is 0 (except the case of voiced-unvoiced or unvoiced-voiced concatenation), i.e. F0-cost = abs(F0{eL} - F0{bR}) or 0.
  • the Euclidean distance of z-score normalized MFCC vectors (in the baseline system only!), i.e. MFCC-cost = eucl(MFCC{eL}, MFCC{bR})
  • the distances of z-score normalized forman contours (in the experimental system only!), see below.

Distances

For further explanation, expect F1{t}, _F2{t}, F3{t} and F4{t} being (z-score normalized) values of formants at time t. When t = eL, than it describes the time nearest to the end of the left concatenated diphone, and t = bR describes the time nearest to the beginning of the concatenated right diphone; i.e. we always examine the difference of eL to bR features (being taken from a phone center). When the concatenated diphones neighbored in the corpus, then it is ensured that eL = bR.

Now it is possible to experiment with 3 computation schema:
  • absolute difference of formants and their slopes (SLOPE):
     
    cost = (abs(F1{eL} - F1{bR}) * W1 + abs(F2{eL} - F2{bR}) * W2 + ... + abs(F4{eL} - F4{bR}) * W4 + abs(S1{eL} - S1{bR}) * W1 + abs(S2{eL} - S2{bR}) * W2 + ... + abs(S4{eL} - S4{bR}) * W4 + F0-cost + energy-cost) / (W1 + W2 + W3 + W4 + 1 + 1)
     
    where Sn is slope of the n-th format computed from sequence of [ Fn{t-4}, Fn{t-3}, Fn{t-2}, Fn{t-1}, Fn{t}, Fn{t+1}, Fn{t+2}, Fn{t+3}, Fn{t+4}] formant values, t = eL or t = bR.
  • Euclidean distance of the formant contour (EUCL):
     
    cost = (eucl(C1{eL}, C1{bR}) * W1 + eucl(C2{eL}, C2{bR}) * W2 + ... + eucl(C4{eL}, C4{bR}) * W4 + F0-cost + energy-cost) / (W1 + W2 + W3 + W4 + 1 + 1)
     
    where Cn{t} = [ Fn{t-4}, Fn{t-3}, Fn{t-2}, Fn{t-1}, Fn{t}, Fn{t+1}, Fn{t+2}, Fn{t+3}, Fn{t+4}] is the sequence of formant values
  • Mean absolute difference of the formant contour (ABS), which is the same as the previous, but except the eucl(Cn{eL}, Cn{bR}) distance we use mean(abs(Cn{eL} - Cn{bR}))

For all the experiments, the weights were set to: W1 = 0.8, W2 = 1.0, W3 = 0.7, and W4 = 0.4. Also, there is no bandwidth considered now!

Formants

The formant frequencies were estimated using PRAAT. It was carried out by Zdeněk H. in #3999. In regions where formant frequencies were not detected, the formant values were set to a high number. This automatically lead to the high cost, when "correct" and "undefined" formant values were compared (i.e. distance of (F.{eL}, F.{bR}) was a high number), thus effectively suppressing the concatenation of units with such formant detection mismatch (except the case where the corresponding formant values were undefined for both .L and right .R parts, where the cost was 0).

Examples of synthesized samples generated with different features (mfcc, formants) added.

The examples.zip zip file contains 2 directories:
  • formants_better - both formant versions sound better compared to the mfcc version
  • mfcc_better - mfcc version of all sentences sounds better than format versions