语音识别最新综述

最近准备入门语音识别，发现了一篇比较好的综述文章，名字叫An Overview of End-to-End Automatic Speech Recognition，限于技术不足，翻译的话可能会有不少误差，这里我想把文章里面的一些重点信息挑出来，希望能够简化论文阅读者的阅读量和理解难度。

文章结构分为以下几个部分：摘要、介绍、背景、基于CTC的端到端语音识别模型、基于RNN-Transducer的端到端语音识别模型、基于Attention的端到端语音识别模型、比较和总结，最后是工作展望和引用。下面我也根据上述几个部分来依次划重点。

摘要部分

简单介绍了ASR的历史，以及HMM-DNN的缺陷，引入端到端模型的原因，然后就是论文的结构。

For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. - DNN的在语音识别的兴起

However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. - HMM-DNN和E2E的简要对比。

介绍部分

介绍了语言模型的基本公式和一些基本概念。

In a large vocabulary continuous speech recognition task, the hidden Markov model (HMM)-based model has always been mainstream technology, and has been widely used. Even today, the best speech recognition performance still comes from HMM-based model (in combination with deep learning techniques). Most industrially deployed systems are based on HMM. - HMM的重要地位。

It replaces engineering process with learning process and needs no domain expertise, so end-to-end model is simpler for constructing and training. These advantages make the end-to-end model quickly become a hot research direction in large vocabulary continuous speech recognition (LVCSR). - 端到端模型的优越之处以及发展潜力。

背景部分

这里面第一部分是介绍ASR的历史，从第一个语音识别到贝尔实验室出来的真正的语音识别到SPHINX再到最新进展。

the SPHINX system [13] developed by Kai-Fu Lee of Carnegie–Méron University, which uses HMM to model the speech state over time and uses GMM to model HMM states’ observation probability, made a breakthrough in LVCSR and is considered a milestone in the history of speech recognition. - 李开复的SPHINX的划时代意义。

In 2011, Yu Dong, Deng Li, etc. from Microsoft Research Institute proposed a hidden Markov model combined with context-based deep neural network which named context-dependent (CD)-DNN-HMM [16]. It achieved significant performance gains compared to traditional HMM-GMM system in LVCSR task. Since then, LVCSR technology using deep learning has begun to be widely studied. - 微软的深度模型的引入为语音识别做出巨大贡献。

第二部分简单介绍了为LVCSR做出了巨大贡献的基于HMM的模型和端到端的模型。

Based on the differences in their basic ideas and key technologies, LVCSR can be divided into two categories: HMM-based model and the end-to-end model. - LVCSR可以被划分为两类：HMM-based和端到端。

HMM-Based Model

简单介绍基于HMM的模型。

In general, the HMM-based model can be divided into three parts, each of which is independent of each other and plays a different role: acoustic, pronunciation and language model. - HMM-based模型分为3部分。

The construction process and working mode of the HMM-based model determines if it faces the following difficulties in practical use: