Project management of NTIS P1 Cybernetic Systems and Department of Cybernetics | WiKKY

Project

General

Profile

Periodic reports

2018

Summary of project goals and how they were achieved

The main objective of the project was to remove disruptive effects from synthetic Czech speech by enhancing corpus- based algorithms while exploiting a thorough understanding of human speech production and perception.
Main research activities and findings:
  • The necessity to consider major allophonic variants of Czech phonemes (esp. with respect to voicing, alveolar vs. velar nasal, consonant syllabicity).
  • The quality of speech is affected by rhythmical patterning, and its simplified modeling resulted in frequent audible artifacts. We found that phrase-internal, non-nuclear vowels in word-final syllables are significantly longer in Czech than in non-final contexts. Respecting this word-final lengthening led to a higher perceptual quality of speech. A hybrid approach, which uses continuous temporal patterns, also improved the resulting speech.
  • Procedures were designed to automatically clean speech corpora at the textual and prosodic levels, enhancing both the segmental and prosodic quality of synthetic speech.
  • We proposed phonetically motivated representations (formants, spectral slope, segmental context) and examined whether they contribute to reducing spectral discontinuities. In addition, we showed that articulatory data provide new information and have a potential to improve synthetic speech. These representations are beneficial in that they are more transparent and also seem to be computationally more efficient.
  • Another major source of audible artifacts is imperfect modeling of fundamental frequency (F0). A data-driven algorithm of accurate F0 contour detection was proposed, applicable also for other languages. Based on phonetic findings, we incorporated into unit selection the tendency to a specifically Czech post-stress F0 rise.
  • Basic algorithms for automatic (objective) error detection and synthetic speech evaluation have been proposed.
  • Modification and smoothing of the speech signal were performed using statistical parametric approach (SPS). We also merged concatenative and SPS approaches into a hybrid framework allowing both approaches to be used in a defined ratio, compromising between concatenated and generated speech. A new generation of speech synthesis based on a generative approach that utilizes the statistical power of deep neural networks to directly generate individual speech samples has been researched as well.

Research results were continuously integrated into synthesis algorithms and their influence on the resulting speech quality was verified by listening tests or automatically.

The achievement of the project goals is documented by the attained results - a total of 28 research publications were published (5 articles of the type Jimp, 1 Jost, 22 D - marked according to the current Results assessment methodology). Due to the change in the methodology in the course of the project, the publication of the D results in the proceedings of prestigious conferences was strengthened at the expense of the Jost outputs.

Benefit of 3 most important publications

Matoušek, J., Tihelka, D. Anomaly-based annotation error detection in speech-synthesis corpora. Computer Speech and Language. 2017, vol. 46, pp. 1-35

The main benefit is the design of an algorithm for automatic detection of annotation errors in corpora used for speech synthesis. The algorithm is based on an anomaly detection approach in which misannotated words are taken as anomalous examples which do not conform to normal patterns of a model trained on correctly annotated words. A final listening test showed the effectiveness of the proposed anomaly-based annotation error detection for improving the quality of synthetic speech.

Jůzová, M., Tihelka, D., Skarnitzl, R. Last Syllable Unit Penalization in Unit Selection TTS. Lecture Notes in Artificial Intelligence: Text, Speech, and Dialogue (TSD). 2017, vol. 10415, pp. 317-325

The main benefit is the finding that phrase-internal, non-nuclear, vowels in word-final syllables are significantly longer in Czech than in non-final contexts. It was generally considered that this phenomenon only applies to phrase-final syllables. Respecting this word-final lengthening also for the phrase-internal contexts led to considerably higher perceptual quality of synthetic speech, especially with respect to its rhythmical pattern.

Matoušek, J., Tihelka, D. Classification-Based Detection of Glottal Closure Instants from Speech Signals. Proc. Interspeech 2017. Stockholm, Sweden, 2017, pp. 3053-3057

The main benefit is the design of a classification-based algorithm for a precise detection of glottal closure instants (GCIs) directly from the speech signal. The advantage of the classification-based method is that once a training dataset is available, classifier parameters are set up automatically without manual tuning. The proposed GCI detection consistently outperformed traditionally used algorithms on several test datasets. Besides Czech, the algorithm was successfully applied on other languages (English, French, German, Slovak).

2017

Research activities are divided into four basic research areas (RAs). In 2017, in addition to the planned activities RA1d, RA2abc, R3abc, R4abc, research also started in RA3d. The publication plan for 2017 was fulfilled and even exceeded – 1-2 results of type Jimp and 1 result of type Jneimp or Jrec were planned and 3 publication of type Jimp were actually published. In addition, 6 publications of type D were published (4 of them were planned). Project progress can be tracked on the project website (https://wikky.zcu.cz/redmine/projects/hqsyn16).

RA1

We have proposed a 2-stage iterative speech corpus prosodic annotation correction procedure (R1d) based on HMMs with large context description: (i) correction of phrase boundaries by the detection/correction of pauses (including also a correction of neighboring phones, e.g. changes of voicing, insertion/deletion of glottal stops, etc.); (ii) correction of the phrase (prosodeme) type. As for word-level annotation correction (R1d), we finished the concept of “anomaly detection” and compared it to a standard classification approach in terms of annotation error detection accuracy and the data size.

RA2

In RA2, the work focused on the perceptual testing of intrusive phenomena, mostly in the spectral domain and in the domain of fundamental frequency (F0). Preliminary analyses revealed that a great majority of the problems resulting from contextual incongruity (i.e., the effect of labial, palatal, or nasal contexts; see RA2a) can be captured by the formant values (RA2c). This led to the creation of a perception test where formant frequencies were manipulated in a controlled way so as to create artificial discontinuities. Subsequent perceptual testing showed that formant discontinuities may have serious perceptual repercussions, affecting the perceived length of the vowels in question; since Czech is a language with distinctive vowel length, this was a very important finding, published in Interspeech 2017 proceedings. F0 shifts (RA2b) yielded no such effects, but work is currently continuing on preparing more sophisticated F0 discontinuities (see RA3c).

RA3

The concept of context penalization matrix (RA3a) was elaborated by the perceptual assessment of incongruent allophonic variants; this led to formulating rules concerning which Czech systematic variants can be pooled for synthesis (submitted as a journal paper in May 2017). Work is continuing on the coarticulatory context penalization matrix, specifically with regard to weighting possibilities. Research regarding phonetically justified parameters and their usage in speech synthesis also concerned the employment of formant frequencies (RA3b). We found the "raw" formants might help to reduce the occurrences of unnatural glitches, but were not enough to eliminate all of them and might introduce some new ones. Thus, we are going to incorporate spectral slope in the ongoing experiments. The research on articulatory synthesis (RA3b) consisted in experiments with using electromagnetic articulograph-based data as new features for unit selection synthesis and in testing the hypothesis that the articulatory data brings new information to speech processing/modeling and thus it makes sense to use it as a supplement to traditional spectral features. As for the continuity of prosodic patterns (RA3c), we proposed a novel classification-based method of glottal closure instants detection from the speech signal that can serve as a precise estimate of F0 pattern. We showed that the proposed method outperformed other state-of-the-art methods. Besides Czech, the method was also tested on Slovak, English, German, and French speech signals. Work on determining intrusive F0 jumps (RA3c) started with PSOLA manipulations of F0 in various types of segments. Preliminary auditory analyses showed little effect of plain +/- 2 ST and even +/- 4 ST discontinuities, and work is currently continuing on devising more sophisticated F0 discontinuities, which also incorporate the direction of change (delta values); perceptual testing will begin in January 2018. We also focused on F0 continuity in first two syllables of phrase-internal prosodic words and confronted the phonetic knowledge with the real intonation contours measured in source speech data with the aim to utilize this knowledge in speech synthesis. We further revised discrete syllable-based positional features to cope with the “last syllable lengthening phenomenon” (RA3d) and achieved encouraging results with the elimination of the inappropriate syllable position mixing. We also tried to tweak continuous positional parameters computation to mimic the results obtained. In addition, we confronted the consistency of phonetically motivated phrase-internal prosodic word positioning with the real intonation contours present in our speech corpora. Work is in progress on a data-driven formalization of the very specific melodic contours of Czech wh- questions, as well as of phonetic word parsing, whose more complex details (like joining two two-syllabic words into one stress group) have never been addressed. A paper addressing the properties of Czech stress and, most importantly for this project, duration discontinuities across stress group boundaries, has been submitted as an impact journal paper.

RA4

We continued our research on automatic (objective) evaluation of synthetic speech (R4a). We showed that for two voices under consideration the ANOVA-based artifact detection, and GMM-based artifact localization and classification yield results comparable to the those obtained by subjective listening tests when a special care is given to feature preparation and/or to the determination of the region of interest. As similar results were obtained for age and gender classification in Czech and Slovak, the proposed approach should work for other voices too. For signal modification and smoothing (RA4b), we focused on a statistical parametric speech synthesis (SPS) approach using HMM- and DNN-based statistical models to estimate speech parameters from a given linguistic input. In HMM-based synthesis, we performed experiments with a variable number of states, optimal for particular models. We also compared HMM- and DNN-based models, and several speech analysis-synthesis methods (vocoders) used to generate speech from the estimated speech parameters. We also employed a novel approach based on a neural network with a special architecture known as “WaveNet” which can directly generate speech samples given an input linguistic and target prosodic specification. A hybrid speech synthesis framework (RA4c), a combination of unit-selection and SPS methods, was also elaborated.

Plan for 2018

In 2018, work will continue on topics which were started in 2017. Specifically, perceptual testing of more sophisticated F0 manipulations where the direction of change and not only the interval is addressed will be conducted at the beginning of the year, which will allow us to formulate rules for stronger and weaker penalizations of F0 discontinuities (RA2b, RA3c), and we plan to submit a journal paper by the middle of the year. Work will finish on the penalization matrix for different coarticulatory contexts (RA3a). Other articulatory-based and phonetically justified parameters like EMA-based measures and spectral slopes will be tested as a supplement of formant frequencies and/or replacement of traditionally used spectral parameters (MFCCs) to obtain smoother spectral patterns (RA3b). Analyses will continue of the melodic contours of the wh- questions and their reflection in synthetic speech (RA3d); two journal papers will be submitted. Also as part of RA3d, stress group parsing will be compared in natural and synthetic speech; ideally, this will yield formalized recommendations for contexts in which two poly-syllabic words may occur as part of a single stress group, resulting in the improved naturalness of synthetic speech. Based on these findings, the positioning and weighting scheme will be revised and tested in speech synthesis. The results will be submitted as a journal paper. The researched approaches - statistical-parametric based and unit-selection based - will be merged into a hybrid speech synthesis framework (R4c) utilizing advantages of both approaches. In addition, work on the progressive state-of-the-art WaveNet-based speech synthesis method will continue to enable a compromise between natural speech samples (which, when concatenated, can result in discontinuities and disruptive artifacts) and generated speech samples (R4d).

The plan of submissions for 2018 reflects the originally proposed plan and the already published (or submitted) works. In 2017, one article (Jrec) was submitted to “Akusticke listy” and has been under review for 7 months already! Another article (Jimp, “Slovo a slovesnost”) was submitted at the end of August. We hope both articles will be published in 2018. The revised plan for 2018 thus focuses on journal papers:

1-2 results of type Jimp, 4-5 results of type Jneimp or Jrec, and at least 2 results of type D

2016

Research activities are divided into four basic research areas (RAs). In 2016, in addition to the planned activities RA1x and R4a, research also started in RA3b. The publication plan for 2016 was fulfilled and even exceeded: while 3 results of type D were planned, 8 publications of type D were actually published.

RA1

RA1a

The analysis and cataloguing of disruptive artifacts in synthetic speech proceeded throughout the year. Apart from isolated problems which were easily remedied, auditory analyses uncovered problems in the transcription of some less frequent loan words, especially in their derived forms (e.g., the insertion of a glottal stop in place of a hiatus connection in words like archaický). Most attention was dedicated to the way allophonic variants are selected for synthesis: this concerned 1) the voicing distinction in assimilatory contexts (the voiceless allophone of the Czech fricative trill /ř/, the voiced allophone of the velar fricative /x/), 2) the alveolar vs. velar nasal in contexts of place assimilation, and 3) the plain vs. syllabic r and l sounds. It was found that the selection of an "incorrect" allophone causes frequent problems in some source speakers (see RA1c).

RA1b

In our TTS system, some of the rare units are pooled together with the more frequent similar allophones due to the expectation of unnatural artefact occurrences when the algorithm does not have enough candidates for the selection. However, this pooling introduces another kind of intrusive effects, when an “incorrect” allophonic variant is chosen without the algorithm knowing to do so. Determining which of the units should not be pooled (for which speaker) was carried out in RA1c.

RA1c

The detailed auditory and acoustic analyses concerned the effect of selecting for synthesis an “incorrect” allophonic variant (see RA1a), in five source voices. The degree of intrusive effect is partly speaker-specific but not negligible, and led to recommendations as to which allophones of one phoneme should be kept apart in the database and which can be pooled. In addition, the effect of prosodic position on acoustic qualities in fricatives was examined with respect to suitability in synthetic speech.

RA1d

Automatic cleaning of speech corpora was focused on fixing prosodic and word-level textual annotations of source speech recordings. As for prosodic annotation (of the Czech language), the prosody structure given by the written sentence and the structure of the spoken utterance do not always agree: different prosodeme type, unexpected pause placement and phrase/clause borders can be present. Then, the initial prosodic annotation is incorrect. We have proposed several iteratively-working algorithms for the automatic correction of prosodic annotations. They were tested on 4 large speech corpora and successfully verified by listening tests. As for word-level annotation correction, a concept of a voting detector – a combination of several anomaly detectors in which each "single" detector votes on whether or not a testing word is annotated correctly – has been proposed to reveal annotation errors in speech corpora. The influence of the number of anomalous and normal examples on the detection accuracy was also investigated.

RA3

RA3b

Research regarding phonetically justified parameters and their usage in speech synthesis concerned mainly the acquisition of articulation data. Electromagnetic articulograph (EMA) was used for the acquisition. Firstly, experiments with sensor placement and fixing were conducted, and articulation data were recorded along with speech. Then, the articulation and speech recordings were pre-processed so that they could be integrated into the process of speech synthesis.

RA4

RA4a

Inspired by the anomaly detection framework used to detect annotation errors in R1d, we experimented with a similar framework (called one-class classification in this context) to predict potential artifacts in the to-be-generated speech. To automate the prediction of join smoothness, with the advantage of automatic per-speaker tuning, we have tried to employ one class classification approach, with the classifiers trained on the natural (and thus smooth) unit transitions from the source speech corpus. We have focused on vowels due to their signal stability at the point of concatenation. We have also started experiments with the automatic (GMM-based) evaluation of synthetic speech. Both natural and synthetic (and also voice-converted) speech of different speakers (of different age and gender; speaking in Czech and Slovak) was examined for this purpose.

Plan for 2017

In 2017, the work on automatic cleaning of speech corpora (R1d) will continue, especially those concerning prosodic annotation correction. We will also continuously monitor, analyze and catalogue synthetic speech artifacts (R1a).

The focus of our work within RA2 will consist in designing and administering perceptual tests, a task which will also continue in the final year of the project. We will concentrate on aspects identified as intrusive. First of all, we will examine the effect of selected coarticulatory details (e.g., placing a consonant-vowel diphone drawn from a palatalized context into a non-palatal word). As a result of these research activities, we will strive to compile a matrix penalizing specific contexts in unit selection; this corresponds to the task defined in RA3a. Next, we will attempt to formulate recommendations concerning the smooth concatenation with respect to the fundamental frequency (F0) contour (RA2b). Finally, we will address spectral mismatches which seem to go undetected using MFCC metrics but are visible in the spectrogram (RA2c); we will try to identify contexts in which such mismatches most impair perception and we will seek ways of objectifying such mismatches.

In RA3, based on our findings in RA2, we plan to experimentally verify the concept of penalization matrix that would drive unit selection, modeling and concatenation using phonetically motivated aspects (RA3a). We will also continue in our research into phonetically justified parameters (like formants, spectral tilt etc., see RA2c). Besides other things, we plan to incorporate EMA-based articulatory data into speech synthesis algorithms (RA3b). Based on the results from RA2b, we will also start works on incorporating smooth, phonetically motivated, prosodic patterns (especially F0 contour).

The focus of our work within RA4 will consists in designing a framework for hybrid speech synthesis, incorporating both signal- and model-based approaches (R4c, R4d), that has a potential to overcome problems existing in both approaches when they are employed separately. The framework will include a possibility to predict artifacts in synthetic speech (R4a) and to modify the signal of synthetic speech (R4b).

The publication plan for 2017 is slightly modified with respect to the plan for three reasons. First, part of the work in the first year (RA1a) consisted largely in mapping and identifying problems in the current speech synthesis system, a task which does not lend itself to publications. More importantly, the work on perceptual testing (RA2) is only just beginning; we expect first results to be submitted but it would be unrealistic to see them published in the same year. Finally, more studies were published in 2016, the first year of the project; due to the character of the results they were all type D outputs. The modified plan for 2017 is thus as follows:

1-2 results of type Jimp, 1 result of type Jneimp or Jrec, and at least 4 results of type D.