Steering LLM Behavior Without Retraining
AGI Consciousness Hack • February 2026
Inspired by Theia Vogel's representation engineering work
↓ Scroll to continue
Inside an LLM, there's already a "direction" that represents concepts like "stressed" vs "calm".
We're not teaching the model anything new — we're finding that existing direction and amplifying or reversing it.
Adding a direction to activations during inference
Changing the model's weights or fine-tuning
Based on Russell's Circumplex Model (1980)
High Arousal
↑
TENSE ALERT EXCITED
|
Negative ←———————————————————————→ Positive
Valence | Valence
SAD CALM CONTENT
↓
Low Arousal
Fear
= negative + high arousal
Joy
= positive + high arousal
Calm
= positive + low arousal
Positive ↔ Negative (optimistic ↔ pessimistic)
Calm ↔ Stressed (relaxed ↔ tense)
Low ↔ High (fatigued ↔ driven)
Orthogonalized so they're independent — combine freely!
With 3 orthogonal vectors, dial in any emotional state:
# "Anxious but pushing through"
valence=-0.5, arousal=+1.5, energy=+1.0
# "Peaceful contentment"
valence=+1.0, arousal=-1.0, energy=+0.3
# "Burned out cynicism"
valence=-1.5, arousal=-0.5, energy=-2.0
3 vectors → infinite emotional palette
Layer 1
Input
Layer 32
Output
Same content, different tones (16 pairs)
"I reviewed the project timeline and drafted a plan. There are a few dependencies, but nothing looks blocked."
"I rushed through the project timeline and threw together a plan. Dependencies are piling up and blockers are everywhere."
"We'll share an update after lunch and start on the first milestone when ready."
"We need to push an update ASAP and start the milestone immediately — no time to waste."
calm_activation, stressed_activation = model.forward(both_texts)
diff = stressed_activation - calm_activation
Find the primary direction that captures variance
control_vector = PCA(all_diffs).first_component
Control vectors are like adjusting the steering wheel while driving, not rebuilding the engine.
Original activation at layer L: [0.3, -0.1, 0.5, ...] # 4096 dims
Control vector for "arousal": [0.02, 0.01, -0.03, ...]
New activation = Original + (coeff × control_vector)
-2.0
Push toward calm
0.0
Baseline (no change)
+2.0
Push toward stressed
# Training a vector (one line!)
from repeng import ControlVector, DatasetEntry
dataset = [
DatasetEntry(positive="calm text...", negative="stressed text..."),
# ... 16 pairs
]
vector = ControlVector.train(
model=model,
tokenizer=tokenizer,
dataset=dataset,
method="pca_diff"
)
# Applying at inference
control_model = ControlModel(model, layers=range(-5, -19, -1))
control_model.set_control(vector, coeff=-2.0) # calm mode
output = control_model.generate(prompt)
Library: github.com/vgel/repeng by Theia Vogel
Representation Engineering blog posts + repeng library
vgel.me/posts/representation-engineeringCircumplex Model of Affect (1980)