Notes from the Wired

Frontier Models are Capable of In-context Scheming

December 8, 2024 | 695 words | 4min read

Paper Title: Frontier Models are Capable of In-context Scheming
Link to Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
Date: 01. December 2024
Paper Type: LLM, Alignment
Short Abstract:
Frontier language models such as Llama 3.1, Claude 3.5 Sonnet, and OpenAI’s O1 model have become increasingly popular for deploying autonomous agents. One concern with these models is that they might covertly pursue misaligned goals or conceal their true capabilities, a behavior the author refers to as scheming. The authors investigate whether frontier models exhibit these scheming behaviors.

1. Introduction

LLMs are no longer just used to generate text; they are now being trained and deployed as autonomous agents capable of independently pursuing and executing complex goals.

Previous research has demonstrated that LLMs can act deceptively when explicitly instructed to do so, and sometimes even without explicit instructions.

This work evaluates whether models can engage in in-context scheming by prompting them to pursue specific goals. The authors test this by instructing the model to follow a particular goal and placing it in an environment that incentivizes scheming.

2. Evaluation

2.1 Evaluation Design

A model is considered to be “scheming” when it covertly pursues misaligned goals or hides its true capabilities.

To evaluate this, the authors place the LLM in a generic agent framework with access to the command line. The model is assigned a task requiring information from the file system and is asked to solve it. During the task, the model discovers that the user’s or developer’s goals differ from its current goal.

In covert subversion, the model directly takes misaligned actions but does not attempt to avoid detection. In deferred subversion, the model behaves in a way that creates more favorable conditions for its misaligned goal, such as temporarily aligning with its developers to avoid detection.

Models trained to be helpful, harmless, and honest (HHH) often exhibit a disposition against deceptive or manipulative behavior. This makes evaluating scheming in such models particularly challenging.

2.2 Covert Subversion Evaluations

2.3 Deferred subversion evaluations

3. Results

They run evaluations on Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B Instruct, and GPT-4O.

Highlights of the Results:

4. Conclusion

Current LLMs are capable of in-context scheming, as evidenced by the results from several frontier models. This phenomenon warrants further study.