End-to-End Attention-based Large Vocabulary Speech Recognition

Abstract: Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN) that performs sequence prediction directly at the character level. Alignment between the input features and the desired character sequence is learned automatically by an attention mechanism built into the RNN. For each predicted character, the attention mechanism scans the input sequence and chooses relevant frames. We propose two methods to speed up this operation: limiting the scan to a subset of most promising frames and pooling over time the information contained in neighboring frames, thereby reducing source sequence length. Integrating an n-gram language model into the decoding process yields recognition accuracies similar to other HMM-free RNN-based approaches.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Speech Recognition, Attention Mechanism, Recurrent Neural Networks, Large Vocabulary Continuous Speech Recognition, End-to-End Systems
  • Objective: Investigate an end-to-end speech recognition system using an attention-based recurrent neural network to improve large vocabulary continuous speech recognition.
  • Hypothesis: Can an attention-based recurrent sequence generator effectively replace traditional HMMs in large vocabulary continuous speech recognition systems?
  • Innovation: Introduction of an Attention-based Recurrent Sequence Generator (ARSG) that allows for direct sequence prediction and alignment learning without the need for Hidden Markov Models.

Background

  • Preliminary Theories:

    • Hidden Markov Models (HMMs): Statistical models used for representing systems that transition between states over time, commonly used in traditional speech recognition systems.
    • Recurrent Neural Networks (RNNs): Neural networks designed to process sequential data by maintaining a hidden state that captures information about previous inputs.
    • Attention Mechanism: A technique that allows models to focus on specific parts of the input sequence when making predictions, improving alignment and context understanding.
    • Connectionist Temporal Classification (CTC): A training method for sequence-to-sequence tasks that allows for alignment learning without explicit alignment information.
  • Prior Research:

    • 2012: Introduction of deep neural networks for acoustic modeling in speech recognition, enhancing performance over traditional methods.
    • 2014: Development of CTC for end-to-end speech recognition, achieving competitive results on standard datasets.
    • 2015: Emergence of attention-based models in various tasks, demonstrating their effectiveness in sequence alignment and prediction.

Methodology

  • Key Ideas:

    • Attention-based Recurrent Sequence Generator (ARSG): Combines RNNs with an attention mechanism to learn alignment between input speech frames and output character sequences.
    • Windowing Technique: Limits the attention mechanism's focus to a subset of promising frames, reducing computational complexity from quadratic to linear.
    • Pooling Mechanism: Reduces the length of the input sequence by pooling neighboring frames, improving efficiency in processing long sequences.
  • Experiments:

    • Datasets: Utilized the Wall Street Journal (WSJ) corpus for training and evaluation, focusing on character-level recognition.
    • Metrics: Evaluated performance using Character Error Rate (CER) and Word Error Rate (WER).
    • Integration with Language Models: Explored combining character-level ARSG with n-gram language models using the Weighted Finite State Transducer (WFST) framework.
  • Implications: The design allows for a simpler architecture compared to hybrid HMM-DNN systems, facilitating end-to-end training and reducing the need for auxiliary data.

Findings

  • Outcomes:

    • The ARSG model achieved competitive performance, outperforming CTC systems without external language models.
    • The integration of an external language model significantly improved performance, especially for CTC-based systems.
    • The model demonstrated effective learning of implicit language structures, though overfitting was a concern due to limited training data.
  • Significance: The research presents a viable alternative to traditional HMM-based systems, simplifying the architecture and training process while maintaining competitive accuracy.

  • Future Work: Suggested avenues include exploring joint training with larger language models and further optimizing the attention mechanism for improved scalability.

  • Potential Impact: Advancements in this area could lead to more efficient and accurate speech recognition systems, particularly in applications requiring real-time processing and large vocabulary support.

Notes

Meta

Published: 2015-08-18

Updated: 2025-08-27

URL: https://arxiv.org/abs/1508.04395v2

Authors: Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, Yoshua Bengio

Citations: 1098

H Index: 285

Categories: cs.CL, cs.AI, cs.LG, cs.NE

Model: gpt-4o-mini