You are here

Speech Production in Automatic Speech Recognition Systems

Event date: 
Thursday, 2 October, 2014 - 11:00
Sala Direzione Edificio Ovest Povo
Claudia Canevari, Istituto Italiano di Tecnologia (IIT), Genoa, Italy

This presentation focuses on Automatic Speech Recognition that combines acoustic data with information

about the vocal tract movements during speech production. The motivation is twofold: (i) exploiting the regular

and robust vocal tract articulators behaviour during speech production (King et al. 2007) in order to improve

the accuracy of ASR systems in all those real situations (e.g. speech and speaker variability, noise

environment, mismatching between training and testing set) where their recognition performances are still

dramatically behind the human level (, (ii)

computationally supporting recent empirical neurophysiological evidences on effective and causal

contributions of the motor system in human brain during speech perception and understanding (D’Ausilio A. et

al. 2009).

Two main research areas in speech technology are covered: (i) speech inversion, a procedure that estimates

the behaviour of vocal tract articulators in terms of measured articulatory features (AFs) from speech

acoustics, and (ii) acoustic modelling for speech recognition.

Firstly the presentation describes two different Deep-Neural-Network-based speech inversion strategies that

go through multi-layered and hierarchical representations of the acoustic and articulatory domain obtained by

unsupervised learning of the Deep-Belief-Networks (DBNs). The unsupervised learning of DBNs is used to (i)

pretrain Multi-Layer-Perceptrons (MLPs) that perform speech inversion and (ii) create a “Deep” and “less

noisy” articulatory representation of the articulatory domain that is used subsequently as new target in speech


Secondly it demonstrates a strategy for learning a DNN-based speech inversion where the contribution of

each AF to the global reconstruction error is weighted by its relevance in the production of a given sound. The

relevance of an AF is computed as a function of its frame-wise variance estimated through a Mixture Density

Network (MDN) given the acoustic evidence. That aims to improve the articulatory reconstruction of those

vocal tract articulators that are more critical for the production of a given sound to the detriment of the

reconstruction of less critical ones.

Finally it shows the benefits of appending recovered AFs to acoustic observations in DNN-based phone

recognition systems, in speaker-dependent phone recognition and in speaker-independent phone

classification tasks, in clean and noisy conditions.


Diego Giuliani