Towards End-To-End Automatic Speech Recognition
Streaming automatic speech recognition (ASR) systems consist of a set of separate components, namely an acoustic model (AM); a pronunciation model (PM); a language model (LM), and an end pointer (EP). Traditionally, these components are trained independently on different datasets, with a number of independence assumptions made for tractability.
Over the last several years, there has been a growing interest in developing end-to-end systems, which attempt to learn these separate components jointly in a single neural network. When given input acoustic frames, it directly outputs a probability distribution over graphemes or word hypotheses. Such end-to-end models have shown to surpass the performance of a conventional ASR system.
In this talk, we will present a number of recently introduced innovations that have significantly improved the performance of end-to-end models. We will also discuss some of the shortcomings and ongoing efforts to address these challenges.
Senior Research Scientist at Google Inc.
Bo Li received a Ph.D. degree in Computer Science from the School of Computing, the National University of Singapore in 2014, and a B.E. degree in Computer Engineering from the School of Computer, Northwestern Polytechnical University, China, in 2008.
He is currently a Senior Research Scientist at Google. His research interests are mainly in acoustic modeling for robust automatic speech recognition, including deep learning adaptation methods and lifelong learning.