Distributed Representations of Words and Phrases and their Compositionality

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Word embeddings, Skip-gram model, Negative sampling, Compositionality, Natural language processing
  • Objective: To present an efficient method for learning high-quality distributed vector representations of words and phrases, enhancing their compositionality.
  • Hypothesis: The research hypothesizes that the Skip-gram model can be extended to effectively learn representations for phrases and that these representations exhibit meaningful linear relationships.

Background

  • Preliminary Theories:

    • Word Embeddings: Representations of words in a continuous vector space, allowing for capturing semantic relationships.
    • Skip-gram Model: A neural network architecture that predicts surrounding words given a target word, enabling efficient learning of word vectors.
    • Compositionality: The principle that the meaning of a phrase can be derived from the meanings of its individual words, which the authors argue is limited in traditional models.
    • Negative Sampling: A technique to simplify the training of neural networks by focusing on distinguishing true data from noise, improving training efficiency.
  • Prior Research:

    • 1986: Early work on neural networks for language modeling by Rumelhart, Hinton, and Williams.
    • 2013: Introduction of the original Skip-gram model by Mikolov et al., which demonstrated significant improvements in learning word representations.
    • 2013: Development of Noise Contrastive Estimation (NCE) by Gutmann and Hyvärinen, providing a framework for efficient training of probabilistic models.

Methodology

  • Key Ideas:

    • Subsampling of Frequent Words: Reduces the training time and improves the quality of word representations by discarding common words during training.
    • Negative Sampling: A simplified version of NCE that allows for faster training by using logistic regression to differentiate between true and noise samples.
    • Phrase Identification: A data-driven approach to identify and represent phrases as single tokens, enhancing the model's expressiveness.
  • Experiments:

    • Ablation Studies: Evaluated the impact of subsampling and different sampling methods (Negative Sampling vs. Hierarchical Softmax) on the quality of learned representations.
    • Phrase Analogy Tasks: Developed a test set for evaluating the quality of phrase vectors through analogical reasoning tasks, such as “Montreal”:“Montreal Canadiens”::“Toronto”:“Toronto Maple Leafs”.
  • Implications: The methodology allows for the efficient training of models on large datasets, improving the representation of both common and rare words and phrases.

Findings

  • Outcomes:

    • The Skip-gram model with subsampling and Negative Sampling significantly outperformed previous models in terms of training speed and representation quality.
    • The learned word and phrase vectors exhibited linear structures that facilitated analogical reasoning, demonstrating compositionality.
    • The model achieved a high accuracy of 72% on the phrase analogy dataset, indicating effective learning of phrase representations.
  • Significance: This research challenges previous beliefs about the limitations of word representations, showing that phrases can be effectively modeled and that simple vector arithmetic can yield meaningful results.

  • Future Work: Suggested exploration of more complex phrase structures and the integration of additional linguistic features into the model.

  • Potential Impact: Advancements in natural language processing tasks, such as machine translation and sentiment analysis, through improved understanding and representation of language.

Notes

Meta

Published: 2013-10-16

Updated: 2025-08-27

URL: https://arxiv.org/abs/1310.4546v1

Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean

Citations: 31759

H Index: 168

Categories: cs.CL, cs.LG, stat.ML

Model: gpt-4o-mini