Project management of NTIS P1 Cybernetic Systems and Department of Cybernetics | WiKKY

Project

General

Profile

Roadmap

RA1: Analysis of artifacts in synthetic speech

open

over 6 years late (31.12.2017)

Analysis of artifacts in synthetic speech

97%

81 issues   (75 closed6 open)

RA1: Analysis of artifacts in synthetic speech

The main goal in this research area is to carry out a thorough analysis of problems present in synthetic speech, to list them (RA1a) and to describe them both on “technical” and phonetic levels. By the term “technical” we mean mainly the dependence of the disruptive effects on the internal mechanism the synthesizer is based on. For instance, the correlation of the occurrence of disruptive effects (collected by expert phonetician listening and/or listening test with naive listeners) and the values of criterial (penalization) function (either the total cumulative function or partial functions based on particular components within target and join cost functions like F0, spectral, energy, various positional parameters describing e.g. the position of a speech unit within a word/phrase/utterance etc.) or HMM outputs. Statistical analysis and outlier detection techniques are planned to be used for this purpose (RA1b). From the phonetic point of view, the description will focus on spectral discontinuities that have been shown to influence listeners in English [MIA06] and it is reasonable to expect a similar effect in Czech. For instance, the typical nasal formants at the mean level of 250 and 1000 Hz and the nasal antiformants (suppressed frequency bands) are not constant across all the nasalized vowels and their neighbourhood. Similarly, the lip-rounding during articulation lowers the position of formants in sonorous segments. Also, the same vowel will have different spectral characteristics depending on the prominence of the syllable in which it occurred. Thus, if two segments or two parts of a segment meet—from contexts with different degree of nasalization, lip- rounding, or prominence—discontinuity in the spectrum with perceptual consequences is likely to occur (RA1c). The knowledge acquired from these analyses will be used in other RAs. Within this RA, based on the previous findings, automatic cleaning of speech corpora (in sense of fixing the annotation of source speech recordings [MAT13]) will be performed by training a classifier/detector with positive (i.e., with the identified misannotation) and/or negative (i.e., with correct annotation) examples and subsequently by applying the classifier/detector to all source recordings (RA1d).

Activity Objective Workplace 2016 2017 2018 Dissemination
RA1a Analysis and cataloguing of artifacts CU x Jimp: 1, Jneimp: 1, D: 2
RA1b Technical description of artifacts UWB x
RA1c Phonetic description of artifacts UWB x
RA1d Automatic cleaning of speech corpora UWB x x

RA2: Identification of relevant phonetic parameters

open

over 5 years late (31.12.2018)

Identification of relevant phonetic parameters crucial for the quality of synthetic speech

66%

3 issues   (2 closed1 open)

RA2: Identification of relevant phonetic parameters crucial for the quality of synthetic speech

The phonetic reality of speech production is characterized by smooth, continuous movements of articulators. Individual segments change into one another gradually and even abrupt changes (such as the plosion of plosives) are framed in transitions of spectral properties around them. Within RA2 we plan to ascertain the importance of individual contributors to fluent transitions (and their violations) through reliable perceptual testing. Random samples of listeners will provide measures of the impact of transitions in the domain of spectral correlates of nasality, labialization and “palatalizing” influence of [i:] and [j] (RA2a), fundamental frequency (RA2b), and spectral manifestations of stress (RA2c). Synthesized chains of segments with specially designed parameters (varying in degree and type) will be used to test hypotheses about transitions. The experiments will obviously include natural speech samples to provide benchmark measures. A subset of the data will be tested using the word-monitoring paradigm [STU12]; in other words, reaction time measurements will also be used to determine the disturbing effect associated with concatenation.

Activity Objective Workplace 2016 2017 2018 Dissemination
RA2a Perceptual testing of nasal., labial., and palatal. effects CU x Jimp: 1, Jneimp: 1, Jrec: 2, D: 1
RA2b Perceptual testing of F0 parameters CU x x
RA2c Perceptual testing of spectral parameters CU x x

RA3: Phonetically justified parameters for speech synthesis

open

over 5 years late (31.12.2018)

Phonetically justified parameters for unit-selection and hybrid speech synthesis

91%

46 issues   (36 closed10 open)

RA3: Phonetically justified parameters for unit-selection and hybrid speech synthesis

Actual experiments in this RA will be based on the findings from RA1 and RA2; nevertheless, it can be stated now that, according to our preliminary experiments, the context of the selected/modelled units is very important [TIH12]. So, we assume that a proper proposition of “penalisation matrix” (which defines which contexts should be strictly respected during selection/modelling, like the type of labialization, nasality, or palatalizing effect, and which contexts can be interchanged) is a key to smooth concatenation or proper modelling as the violation of the context continuity leads to disruptive effects in synthetic speech (RA3a). To ensure smooth and imperceptible concatenation in the spectral domain, we would also like to propose phonetically justified parameters (as opposed to the traditionally employed MFCCs) and use them both to describe and to control speech properties during unit modelling and/or selection (RA3b). Our preliminary experiments show that spectral tilt (expressed by different phonetic indexes like Kitzing, Hammarberg or Alfa indexes, or other filter bank ratios, or measured only with few first MFCC coefficients) could be the effective parameter. Other experiments will focus on the continuity of prominent prosodic parameters (especially F0 and durational patterns) within the synthesized utterance. In addition to a static measurement of the continuity of the parameters around the concatenation point (which itself cannot prevent unnatural fluctuation of the patterns), we would also like to capture the tendencies of the prosodic patterns on the utterance level, and, in this way, to ensure the continuity of the monitored prosodic parameters (RA3c). Many positional parameters (position of a speech unit in a syllable, word or other supra-word units like phrases) are typically involved in the unit-selection process causing the selection process to be “overfitted” (in fact, in any real speech corpus, all positional parameters are hardly fulfilled, and the selection process has to “sacrifice” some parameters for others; thus the result is not optimal). In our experiments we would like to revise the parameters (e.g. the position in a syllable seems to be not so important in Czech comparing to some other languages), and, in order to ensure optimal result, to find a way of weighting all involved parameters to correspond to phonetic reality of speech perception (RA3d).

Activity Objective Workplace 2016 2017 2018 Dissemination
RA3a Context definition and penalization matrix CU x Jimp: 1, Jrec: 1, D: 4
RA3b Phonetically justified parameters (spectral tilt, ...) UWB x x
RA3c Continuity of prosodic patterns UWB x x
RA3d Revision of positional parameters and weighting UWB x

RA4: Automatic error prediction and signal modification

open

over 5 years late (31.12.2018)

Automatic error prediction and dedicated signal modification

95%

22 issues   (17 closed5 open)

RA4: Automatic error prediction and dedicated signal modification

There is always a danger in concatenation-based speech synthesis that an artifact occurs at a concatenation point, even when phonetically motivated optimizations described in RA3 are proposed. This is caused by the limited size of speech unit database relative to the natural variability of speech. Though it is widely accepted that the best quality in unit selection is achieved when no signal modification is carried out at all, we believe selective signal modification targeted at the specific component of unit selection which causes the artifact can suppress it. Based on the analysis of artifacts in synthetic speech carried out in RA1, an error prediction module will be designed to predict potential artifacts (e.g. F0 discontinuity) in to-be-synthesized speech during the unit-selection runtime (RA4a) [LU10], [VIT13], [LEG13]. According to the type of the predicted artifact, dedicated signal modification (e.g. F0 smoothing) will be carried out (RA4b). Since a combination of unit selection and HMM-based speech synthesis were reported to be helpful in literature (e.g. [BLA07], [SIL10]), hybrid approaches will be examined as well (RA4c). The possibility to generate speech from HMMs when the unit-selection scheme would result in an artifact will also be researched, and a compromise between using the selected (i.e. natural) speech segments (which can, however, result in discontinuities and disruptive artifacts) and generated segments (either by dedicated signal modification technique or by HMM- based synthesis) will be sought. The compromise should balance mixing the selected and smoothed/generated speech, possibly with a configurable scheme according to listeners’ preference (RA4d).

Activity Objective Workplace 2016 2017 2018 Dissemination
RA4a Automatic error prediction UWB x x Jimp: 1, D: 6
RA4b Dedicated signal modification UWB x x
RA4c Hybrid approaches UWB x x
RA4d Compromise between selected and generated speech UWB x

RAx: Administration

open

Project administration and related stuff

100%

52 issues   (52 closed — 0 open)