最近准备入门语音识别,发现了一篇比较好的综述文章,名字叫An Overview of End-to-End Automatic Speech Recognition,限于技术不足,翻译的话可能会有不少误差,这里我想把文章里面的一些重点信息挑出来,希望能够简化论文阅读者的阅读量和理解难度。
文章结构分为以下几个部分:摘要、介绍、背景、基于CTC的端到端语音识别模型、基于RNN-Transducer的端到端语音识别模型、基于Attention的端到端语音识别模型、比较和总结,最后是工作展望和引用。下面我也根据上述几个部分来依次划重点。
简单介绍了ASR的历史,以及HMM-DNN的缺陷,引入端到端模型的原因,然后就是论文的结构。
For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. - DNN的在语音识别的兴起
However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. - HMM-DNN和E2E的简要对比。
介绍了语言模型的基本公式和一些基本概念。
In a large vocabulary continuous speech recognition task, the hidden Markov model (HMM)-based model has always been mainstream technology, and has been widely used. Even today, the best speech recognition performance still comes from HMM-based model (in combination with deep learning techniques). Most industrially deployed systems are based on HMM. - HMM的重要地位。
It replaces engineering process with learning process and needs no domain expertise, so end-to-end model is simpler for constructing and training. These advantages make the end-to-end model quickly become a hot research direction in large vocabulary continuous speech recognition (LVCSR). - 端到端模型的优越之处以及发展潜力。
这里面第一部分是介绍ASR的历史,从第一个语音识别到贝尔实验室出来的真正的语音识别到SPHINX再到最新进展。
the SPHINX system [13] developed by Kai-Fu Lee of Carnegie–Méron University, which uses HMM to model the speech state over time and uses GMM to model HMM states’ observation probability, made a breakthrough in LVCSR and is considered a milestone in the history of speech recognition. - 李开复的SPHINX的划时代意义。
In 2011, Yu Dong, Deng Li, etc. from Microsoft Research Institute proposed a hidden Markov model combined with context-based deep neural network which named context-dependent (CD)-DNN-HMM [16]. It achieved significant performance gains compared to traditional HMM-GMM system in LVCSR task. Since then, LVCSR technology using deep learning has begun to be widely studied. - 微软的深度模型的引入为语音识别做出巨大贡献。
第二部分简单介绍了为LVCSR做出了巨大贡献的基于HMM的模型和端到端的模型。
Based on the differences in their basic ideas and key technologies, LVCSR can be divided into two categories: HMM-based model and the end-to-end model. - LVCSR可以被划分为两类:HMM-based和端到端。
简单介绍基于HMM的模型。
In general, the HMM-based model can be divided into three parts, each of which is independent of each other and plays a different role: acoustic, pronunciation and language model. - HMM-based模型分为3部分。
The construction process and working mode of the HMM-based model determines if it faces the following difficulties in practical use: