Sparks of Artificial General Intelligence: Early experiments with GPT-4

Abstract: Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Artificial General Intelligence, GPT-4, Language Models, Machine Learning, Cognitive Abilities
  • Objective: Investigate the capabilities of an early version of GPT-4 and assess its potential as a form of artificial general intelligence (AGI).
  • Hypothesis: GPT-4 exhibits characteristics of general intelligence, outperforming previous models across various tasks and domains.
  • Innovation: Introduction of a new framework for evaluating AI intelligence that emphasizes human-like reasoning and interdisciplinary capabilities.

Background

  • Preliminary Theories:

    • Transformers: A neural network architecture that has revolutionized natural language processing by enabling models to understand context and relationships in data.
    • Self-Supervised Learning: A training paradigm where models learn from unlabeled data by predicting parts of the input, crucial for training large language models.
    • General Intelligence: The ability to learn, understand, and apply knowledge across a wide range of tasks, a key goal in AI research.
    • Emergent Behaviors: Phenomena where complex capabilities arise from simple rules or structures, often observed in large-scale models.
  • Prior Research:

    • 2018: Introduction of the Transformer model, significantly improving NLP tasks.
    • 2020: Release of GPT-3, showcasing advanced language generation capabilities but limited in reasoning and understanding.
    • 2021: Development of models like PaLM, which further pushed the boundaries of language understanding and task performance.
    • 2022: Initial assessments of GPT-4's capabilities, indicating a shift towards more general intelligence in AI systems.

Methodology

  • Key Ideas:

    • Interdisciplinary Tasks: Evaluating GPT-4 across diverse domains such as mathematics, coding, and the arts to assess its integrative capabilities.
    • Human-like Reasoning: Designing tasks that require not just rote memorization but creative problem-solving and reasoning.
    • Comparative Analysis: Benchmarking GPT-4 against previous models like ChatGPT to highlight advancements in performance.
  • Experiments:

    • Task Performance: Conducting a series of tests across various domains, including coding challenges, mathematical problem-solving, and creative writing.
    • Human Interaction: Engaging with GPT-4 through natural language prompts to evaluate its understanding and generation capabilities.
    • Limitations Exploration: Identifying areas where GPT-4 struggles, such as planning and long-term memory.
  • Implications: The methodology emphasizes a shift from traditional benchmarks to more holistic evaluations of AI capabilities, focusing on human-like intelligence.

Findings

  • Outcomes:

    • GPT-4 demonstrates human-level performance in numerous tasks, including complex coding and creative writing.
    • The model shows significant improvements in commonsense reasoning compared to earlier versions.
    • Limitations remain, particularly in areas requiring long-term planning and memory.
  • Significance: GPT-4's capabilities challenge previous assumptions about AI, suggesting a paradigm shift towards models that exhibit general intelligence traits.

  • Future Work: Suggested areas for further research include improving the model's ability to learn from experience, enhancing its memory capabilities, and refining its understanding of human emotions and intentions.

  • Potential Impact: Advancements in these areas could lead to more robust AI systems capable of functioning in complex, real-world environments, significantly influencing fields such as healthcare, education, and creative industries.

Notes

Meta

Published: 2023-03-22

Updated: 2025-08-27

URL: https://arxiv.org/abs/2303.12712v5

Authors: Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang

Citations: 1949

H Index: 280

Categories: cs.CL, cs.AI

Model: gpt-4o-mini