Lost in the Middle: How Language Models Use Long Contexts

Abstract: While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Language Models, Long Contexts, Multi-Document Question Answering, Key-Value Retrieval, Performance Analysis
  • Objective: Analyze how language models utilize long input contexts in multi-document question answering and key-value retrieval tasks.
  • Hypothesis: Language models perform better when relevant information is located at the beginning or end of the input context, with significant performance degradation when information is in the middle.
  • Innovation: Introduction of new evaluation protocols and insights into the limitations of language models regarding context utilization.

Background

  • Preliminary Theories:

    • Transformer Architecture: Underpins modern language models, utilizing self-attention mechanisms that scale poorly with longer sequences, leading to challenges in processing extensive contexts.
    • Serial-Position Effect: A psychological phenomenon where recall is better for items at the beginning and end of a list, which parallels observed performance trends in language models.
    • Recency Bias: Tendency of models to favor recent tokens over distant ones, affecting their ability to leverage long contexts effectively.
    • Instruction Tuning: A method that fine-tunes models on specific tasks, potentially influencing how they prioritize information in input contexts.
  • Prior Research:

    • 2018: Khandelwal et al. demonstrated that LSTM models struggle with long-term context.
    • 2020: Petroni et al. explored the integration of retrieval systems with language models for improved question answering.
    • 2021: Sun et al. found that longer contexts yield diminishing returns in prediction tasks.
    • 2023: Qin et al. analyzed the performance of long-context transformers, noting recency bias and the impact of query-aware contextualization.

Methodology

  • Key Ideas:

    • Controlled Experiments: Conducted on multi-document question answering and key-value retrieval tasks, manipulating context length and position of relevant information.
    • Performance Metrics: Accuracy was measured based on the model's ability to retrieve and utilize relevant information from the input context.
    • Model Variants: Evaluated both decoder-only and encoder-decoder architectures to assess their context utilization capabilities.
  • Experiments:

    • Multi-Document Question Answering: Models were tasked with identifying relevant information from a set of documents, with performance analyzed based on the position of the relevant document.
    • Key-Value Retrieval: A synthetic task where models retrieved values associated with specific keys from JSON-formatted inputs, testing their basic retrieval capabilities.
    • Evaluation of Extended Context Models: Compared standard models with extended-context versions to determine if longer context windows improved performance.
  • Implications: Findings suggest that simply increasing context length does not equate to better performance; models exhibit a U-shaped performance curve based on the position of relevant information.

Findings

  • Outcomes:

    • U-Shaped Performance Curve: Models performed best when relevant information was at the beginning or end of the context, with significant drops in accuracy when information was in the middle.
    • Decreased Performance with Longer Contexts: As the number of documents increased, model performance consistently declined, indicating difficulty in processing extensive information.
    • Architecture Impact: Encoder-decoder models showed more robustness in context utilization compared to decoder-only models, particularly in longer sequences.
  • Significance: Results challenge the assumption that longer context windows inherently lead to better performance, highlighting specific limitations in how models process information.

  • Future Work: Suggested exploration of improved contextualization techniques, such as query-aware methods, and further investigation into the effects of instruction tuning on performance.

  • Potential Impact: Advancements in understanding how language models utilize context could lead to the development of more effective models and evaluation protocols, enhancing applications in question answering and information retrieval.

Notes

Meta

Published: 2023-07-06

Updated: 2025-10-25

URL: https://arxiv.org/abs/2307.03172v2

Authors: Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang

Citations: 579

H Index: 179

Categories: cs.CL

Model: gpt-4o-mini