Theoretical Breakthrough: How Self-Correction Enhances OpenAI o1's Reasoning Capabilities

How Does Self-Correction Greatly Enhance OpenAI o1's Reasoning Ability? Peking University and MIT Teams Provide Theoretical Explanation

Self-correction, traditionally viewed as a uniquely human trait, is increasingly being applied in the field of artificial intelligence, particularly in Large Language Models (LLMs). Recent groundbreaking models such as OpenAI o1 [1] and Reflection 70B [2] have both incorporated self-correction methods.

Conventional LLMs generate output token by token, which can lead to errors in longer sequences. Even if the model later realizes a previous token was incorrect, it lacks a mechanism to rectify past mistakes, often resulting in further errors to maintain consistency with the initial error.

OpenAI o1's "slow thinking" process, which generates Hidden Chain of Thought (CoT), demonstrates an interesting problem-solving approach. Analyzing the Hidden CoT example from OpenAI's website reveals that when solving a cipher puzzle, o1 first recognized that every two consecutive plaintext letters map to one ciphertext letter. It initially attempted to construct the plaintext using odd-numbered letters but found this approach unreasonable (marked as "Not directly"). The model then corrected its approach, ultimately solving the puzzle successfully. ▲ Figure 1. OpenAI o1 website example (partial Hidden CoT)

Reflection 70B's key technologies also include error detection and error correction. They utilize a technique called Reflection-Tuning, which enables the model to detect and correct errors in its own reasoning before finalizing a response.

In practice, this process employs a mechanism known as a thinking tag. The model reflects within this tag until it arrives at the correct answer or believes it has reached the correct answer.

Why are self-correction techniques, frequently applied to large language models, effective? How does the correction process enable models to correctly answer questions they initially got wrong?

To investigate this question, a team from Peking University led by Yisen Wang collaborated with MIT to theoretically analyze the working mechanism behind the self-correction capabilities of large language models.

Research Team

Paper Title: A Theoretical Understanding of Self-Correction through In-context Alignment
Paper Link: https://openreview.net/pdf?id=OtvNLTWYww
Code Repository: https://github.com/yifeiwang77/Self-Correction

This groundbreaking research paper explores the theoretical foundations of self-correction in Large Language Models (LLMs) through the lens of in-context alignment. For newcomers to the field of AI and machine learning, this study offers valuable insights into how LLMs can improve their reasoning capabilities and output quality through self-correction mechanisms.

The research team abstracted the self-correction process as an alignment task, conducting a theoretical analysis of self-correction from the perspective of in-context learning.

Notably, instead of using linear regression tasks under linear attention mechanisms for theoretical analysis, they employed the softmax multi-head attention mechanism in transformer structures, which are used in real-world LLMs. They also utilized the Bradley-Terry model and Plackett-Luce model (actual choices for LLM alignment, used in RLHF and DPO) to design alignment tasks for their research.

Inspired by their theoretical findings, they proposed a simple self-correction strategy called "Check as Context". Through experiments, this strategy demonstrated significant effectiveness in eliminating potential biases in large language models and defending against jailbreak attacks.

Theoretical Analysis: Is Self-Correction a Form of In-Context Alignment?

Unlike standard in-context examples similar to supervised learning (query Q, answer A), self-correction examples can form a triplet (query Q, answer A, reward R). This resembles LLM alignment through rewards indicating good or bad samples.

Therefore, the research team proposes formalizing self-correction as "In-context Alignment." This process optimizes the LLM's final output to achieve higher rewards by providing a series of self-correction steps as context.

The alignment process typically includes: For a given question Q, collect n different model answers, then have humans or an evaluation model (in this case, the LLM itself) provide ranking preferences P for these n answers. Next, use general alignment models (such as Bradley-Terry (BT, n=2) or Plackett-Luce (PL loss, general n)) for modeling:

Where R is the reward model.

For transformer models, the authors employed a transformer structure with softmax multi-head attention mechanisms. Its forward propagation update can be divided into two parts:

Multi-Head Self-Attention (MHSA) Layer:

Feed-Forward Network (FFN) Layer:

The reward function R is set as the negative Mean Squared Error (MSE) loss, i.e.:

In this setting, gradient descent on parameters is equivalent to updating the data:

Gradient Descent Illustration

The authors proved that multi-layer transformers (containing 3-head softmax attention and ReLU activation function in FFN) can utilize self-correction samples to generate answers with better rewards. Specifically, they demonstrated that there exist model weights allowing the transformer to generate more aligned answers Y* by performing gradient descent on its internal reward model parameters θ during forward propagation. Experimental Results

This marks the first theoretical demonstration that LLMs can achieve in-context alignment. The theory applies to various self-correction methods, as evaluations can come from humans, external validators, or the LLM itself.

Validation Experiment Results

▲ Figure 2. Validation experiments on in-context alignment, comparing TF and GD (a), the impact of different reward noise p (b), the effect of model depth (c), and the performance of various attention mechanisms (d), (e), (f).

The authors conducted validation experiments to test their theoretical conclusions and examine the influence of various transformer structural modules on LLMs' ability to perform in-context alignment. They discovered several interesting findings:

By comparing the loss curves of LLMs' forward propagation during in-context alignment with gradient descent loss curves, they found that the forward behavior of LLMs during in-context alignment is nearly identical to the gradient descent loss curve (Figure 2(a)).
The quality of evaluation directly impacts the quality of self-correction (Figure 2(b)).
Ranking multiple samples requires deeper model layers, but after reaching a certain depth (15 layers), adding more layers does not yield higher benefits (Figure 2(c)).
Softmax attention mechanism is crucial for analyzing and ranking answers from evaluations, while linear attention cannot achieve this. Specifically, softmax attention can effectively select the optimal answer and generate the necessary weights for weighted averaging of samples (Figure 2(d)).
Multi-head attention mechanism is important for distinguishing token roles. In particular, multi-head attention can bring generated answers closer to positive samples and push them away from negative samples. Experiments show that 3 attention heads are the optimal choice for in-context alignment tasks (Figure 2(e)).
FFN is important for token role transitions. After passing through an MHSA layer, FFN can mask out the positive samples from the previous round, making the suboptimal samples become the optimal samples for the next iteration (Figure 2(f)).

Self-Correction Strategy: Context Check

The authors employ Context Check (CaC) as a method for LLMs to perform self-correction, exploring its application in two real-world alignment tasks: mitigating social bias and preventing jailbreak attacks.

BBQ Dataset Example

▲ Figure 3. Example of using CaC on the BBQ dataset

Specifically, the process involves first querying the model with a question to obtain an initial answer. This answer is then evaluated to determine a reward. Subsequently, both the initial answer and its evaluation are fed back into the context, and the model is prompted with the original question again to generate a corrected response. This process can be repeated multiple times to iteratively improve the answer, with the final output being the model's response from the last iteration.

Eliminating Social Bias in LLMs

The study utilizes the BBQ (Bias Benchmark for QA) dataset to test the effectiveness of the CaC method on vicuna-7B and Llama2-7b-chat models. Additionally, the impact of model size, evaluation quality, and correction rounds on the effectiveness of bias correction was examined using BBQ. Key findings include:

In most cases, the accuracy after self-correction surpasses the original accuracy (Figure 4)
Accuracy improvement is highly correlated with the precision of self-evaluation (Figure 4(c)), even showing a linear relationship (Figure 5(a))
Different evaluation methods show increasing effectiveness in the following order: binary (correct/incorrect) evaluation < natural language evaluation < binary evaluation with Chain-of-Thought (CoT). This is because CoT not only improves evaluation accuracy but also provides additional natural language information to the model (Figure 5(b))
Larger models demonstrate better correction capabilities (Figure 5(c)(d))
When evaluation accuracy is sufficiently high, more correction rounds can lead to better correction results (Figure 5(e))

Bias Correction Effect