Control Vectors

Steering LLM Behavior Without Retraining

AGI Consciousness Hack • February 2026

Inspired by Theia Vogel's representation engineering work

↓ Scroll to continue

The Core Idea

Inside an LLM, there's already a "direction" that represents concepts like "stressed" vs "calm".

We're not teaching the model anything new — we're finding that existing direction and amplifying or reversing it.

✓ What it is

Adding a direction to activations during inference

✗ What it's NOT

Changing the model's weights or fine-tuning

Why Valence/Arousal/Energy?

Based on Russell's Circumplex Model (1980)

                    High Arousal
                         ↑
           TENSE    ALERT    EXCITED
                         |
   Negative ←———————————————————————→ Positive
     Valence         |           Valence
           SAD      CALM     CONTENT
                         ↓
                    Low Arousal

Fear

= negative + high arousal

Joy

= positive + high arousal

Calm

= positive + low arousal

Our Three Axes

😊 ↔ 😞

Valence

Positive ↔ Negative (optimistic ↔ pessimistic)

😌 ↔ 😰

Arousal

Calm ↔ Stressed (relaxed ↔ tense)

😴 ↔ ⚡

Energy

Low ↔ High (fatigued ↔ driven)

Orthogonalized so they're independent — combine freely!

The Composability Win

With 3 orthogonal vectors, dial in any emotional state:

# "Anxious but pushing through"
valence=-0.5, arousal=+1.5, energy=+1.0

# "Peaceful contentment"
valence=+1.0, arousal=-1.0, energy=+0.3

# "Burned out cynicism"
valence=-1.5, arousal=-0.5, energy=-2.0

3 vectors → infinite emotional palette

Why Later Layers?

Layer 1

Input

Layer 32

Output

Early Layers (1–13)

• Low-level features
• Syntax, token patterns
• Modifying = breaks coherence

Later Layers (14–27) ✓

• High-level concepts
• Sentiment, style, reasoning
• Where emotions "live"

Finding the Direction

1

Create paired examples

Same content, different tones (16 pairs)

Calm

"I reviewed the project timeline and drafted a plan. There are a few dependencies, but nothing looks blocked."

Stressed

"I rushed through the project timeline and threw together a plan. Dependencies are piling up and blockers are everywhere."

Calm

"We'll share an update after lunch and start on the first milestone when ready."

Stressed

"We need to push an update ASAP and start the milestone immediately — no time to waste."

2

Capture activations for both

calm_activation, stressed_activation = model.forward(both_texts)

3

Take the difference

diff = stressed_activation - calm_activation

4

PCA across all pairs

Find the primary direction that captures variance

control_vector = PCA(all_diffs).first_component

Activations vs Weights

Weights

• Learned parameters
• Fixed after training
• ~7 billion in Mistral-7B
• Changing = fine-tuning

Activations ←

• Values flowing through network
• Different for every input
• We intercept & nudge these
• Reversible, no retraining

Control vectors are like adjusting the steering wheel while driving, not rebuilding the engine.

The Math (It's Simple)

Original activation at layer L:  [0.3, -0.1, 0.5, ...]  # 4096 dims
Control vector for "arousal":    [0.02, 0.01, -0.03, ...]

New activation = Original + (coeff × control_vector)

-2.0

Push toward calm

0.0

Baseline (no change)

+2.0

Push toward stressed

The Actual Code

# Training a vector (one line!)
from repeng import ControlVector, DatasetEntry

dataset = [
    DatasetEntry(positive="calm text...", negative="stressed text..."),
    # ... 16 pairs
]

vector = ControlVector.train(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    method="pca_diff"
)

# Applying at inference
control_model = ControlModel(model, layers=range(-5, -19, -1))
control_model.set_control(vector, coeff=-2.0)  # calm mode
output = control_model.generate(prompt)

Library: github.com/vgel/repeng by Theia Vogel

Credits & References

Theia Vogel

Representation Engineering blog posts + repeng library

vgel.me/posts/representation-engineering

CAIS Research Team

Original RepE paper (Oct 2023)

arxiv.org/abs/2310.01405

James Russell

Circumplex Model of Affect (1980)

💬 Try Interactive Chat → ← Back to Experiment Results