Run LLM Emotion Steering at Scale

1. Introduction

Anthropic's emotion-concepts work finds functional emotion representations in Claude Sonnet 4.5; E-STEER applies representation-level emotion intervention to LLMs and multi-step agents; and newer valence-arousal work suggests emotion vectors can sit in a low-dimensional affective geometry across Qwen and Llama models.¹²³

This opened an interesting field of inquiry for us: How does emotion steering affect human behavior simulations?

To analyze emotion steering within our simulations we faced an engineering problem. Can we make emotion-steered LLM calls feel like normal API calls, while keeping the throughput benefits of batching with vLLM?

For our use case, each request needs to choose its own affective direction. One call might ask for a more fearful continuation, the next for surprise, the next for no steering at all. That rules out several simpler architectures:

A separate fine-tuned model per emotion. This gives fixed behavior, not cheap per-request mixing with dynamic alphas.
A separate server per emotion. This fragments GPU capacity and makes it hard for one batch to contain different steering directions.
Prompt-only steering. This is easy to deploy, but it does not intervene on the internal direction we want to study.
A single-lane, serial Hugging Face generation path. This is the useful compatibility fallback, and it is what the HF backend does, but it gives up the high-throughput batching that makes vLLM attractive. For our simulations we need to run at least 300 tok/s for 20 agents.

At the same time, we still want the serving stack that makes open-weights inference practical: continuous batching, paged attention, GPU utilization, and an OpenAI-compatible API.⁴

That combination creates a narrow target. The steering signal has to travel with the request, survive vLLM's scheduler, and be applied inside the model's residual stream at generation time. From the outside, the result should still look like one /v1/chat/completions endpoint with one optional per-request field:

{
  "vllm_xargs": {
    "steering": [1, 1.5]
  }
}

This guide walks through extracting contrastive emotion directions for Qwen/Qwen3-8B, checking the saved vectors, and serving them behind that API.⁵

The workflow has three steps:

Extract: build steering vectors from labeled contrasts.
Test: inspect the saved bundle before serving it.
Serve: expose an OpenAI-compatible /v1/chat/completions endpoint with an extra steering field.

The vector is a contrastive direction in the residual stream. For emotion $e$ , layer $\ell$ , and activation $h_{\ell}(x)$ , the extractor saves:

v_{e,\ell} = \mathbb{E}[h_{\ell}(x)\mid y=e] - \mathbb{E}[h_{\ell}(x)\mid y\neq e]

At inference time the server adds a weighted sum of these vectors at the chosen layers:

h_{\ell} \leftarrow h_{\ell} + \sum_e \alpha_e v_{e,\ell}

The default dataset path maps GoEmotions labels into six Ekman-style categories: anger, joy, sadness, disgust, fear, and surprise.⁶⁷

2. Step by Step Instructions

The dataset we used here is GoEmotions: short English text snippets annotated with fine-grained emotion labels such as fear, nervousness, amusement, grief, and surprise.⁶ For this workflow, those fine-grained labels are collapsed into six broader Ekman-style emotion groups:⁷

target emotion	GoEmotions labels used
anger	anger, annoyance, disapproval
disgust	disgust
fear	fear, nervousness
joy	admiration, amusement, approval, caring, desire, excitement, gratitude, joy, love, optimism, pride, relief
sadness	disappointment, embarrassment, grief, remorse, sadness
surprise	confusion, curiosity, realization, surprise

Records with no target emotion are removed. Records that mix multiple target emotions are also removed, because they do not give a clean contrast. The remaining examples are balanced so that one large category, such as joy, does not dominate a smaller category, such as disgust.

Extract then runs the model over those texts without generating new text. It only asks: when Qwen3-8B reads this example, what does the hidden state look like at layer 16, 17, 18, and so on? For each selected layer, it captures the residual-stream activation at the last token of the input. Then it builds one vector per emotion by subtracting the average activation for "everything else" from the average activation for that emotion:

emotion vector = average(hidden states for fear) - average(hidden states for non-fear)

The validation probe is a sanity check on those activations. If a simple classifier can tell fear examples from non-fear examples using the hidden states at a layer, that layer contains usable emotion information. The extractor reports this as ROC-AUC and picks the best contiguous layer window.

Test does not run the model. It reads the saved vector bundle and tells you whether the artifact looks usable: which layers were chosen, how many train/validation examples were used, what the validation ROC-AUC was, and how large the saved vectors are.

Serve is the inference step. It loads the saved vectors, starts an OpenAI-compatible server, and applies the requested vector during generation. With the vLLM backend, the steering value travels as request metadata, so different requests in the same server can use different emotions and different alpha values.

2.1 Install

On a CUDA VM:

git clone https://github.com/eigenweltlabs/emotion-steering
cd emotion-steering

python3 -m venv .venv
source .venv/bin/activate

pip install -U pip
pip install -e ".[vllm]"

If you only want extraction plus the Hugging Face backend, pip install -e . is enough. For the vLLM fast path, install .[vllm].

2.2 Extract

The simplest command extracts all six default emotions for Qwen3-8B:

emotion-steering extract \
  --model Qwen/Qwen3-8B \
  --emotions anger,joy,sadness,disgust,fear,surprise \
  --output ./vectors/qwen3-8b-ekman6

By default, the extractor searches the middle band of the model and chooses the best contiguous three-layer window by validation AUC. Qwen3-8B has 36 decoder layers, so the default search is layers 16 through 27.

You can also name layer bands directly:

emotion-steering extract \
  --model Qwen/Qwen3-8B \
  --emotions anger,joy,sadness,disgust,fear,surprise \
  --layers mid \
  --output ./vectors/qwen3-8b-mid

--layers accepts early, mid, late, all, exact integer layer ids, or comma-separated mixes. The presets are search bands. For a single representative layer, use --layer.

To test representative early, mid, and late layers explicitly, pass exact layer ids:

emotion-steering extract \
  --model Qwen/Qwen3-8B \
  --emotions disgust,fear,surprise \
  --layers 4,20,32 \
  --window 1 \
  --batch-size 4 \
  --max-length 128 \
  --dtype bfloat16 \
  --output ./vectors/qwen3-8b-early-mid-late

For a single layer:

emotion-steering extract \
  --model Qwen/Qwen3-8B \
  --emotions anger,joy,sadness \
  --layer 20 \
  --output ./vectors/qwen3-8b-layer20

For a normal mid-layer sweep:

emotion-steering extract \
  --model Qwen/Qwen3-8B \
  --emotions anger,joy,sadness,disgust,fear,surprise \
  --layers mid \
  --window 3 \
  --output ./vectors/qwen3-8b-mid

The output directory contains:

<emotion>_chosen.npy: vectors for the selected layer window.
<emotion>_full_sweep.npy: vectors for every searched layer.
metadata.json: model id, emotions, search layers, chosen layers, validation AUCs, and extraction settings.

Vectors are model-specific. If you switch from Qwen3-8B to another model, extract a new bundle unless the hidden size, layer indexing, tokenizer behavior, and residual-stream convention are known to match.

2.3 Test

Before serving, inspect the bundle:

emotion-steering test ./vectors/qwen3-8b-mid

The test command reports two kinds of numbers:

n_train / n_val is the number of labeled examples used for extraction and validation.
mean AUC and the per-layer table are validation ROC-AUC scores. These are unitless ranking scores: 0.5 is chance, 1.0 is perfect separation. They are not emotion intensity, probability, or vector size.
Norms at chosen layers are L2 magnitudes of the saved steering vectors in the model's hidden-state coordinates. They are useful for catching obviously broken vectors, but they are not human-readable emotion units and should not be compared across different model families.

The request-time alpha is a multiplier on the chosen vector. If a vector norm is 70 and you call it with alpha = 1.5, the residual-stream intervention uses 1.5 * v; the norm is not itself the strength setting.

On our L4 smoke test, a compact early/mid/late run over disgust,fear,surprise produced these unitless validation ROC-AUC scores:

layer	disgust	fear	surprise	mean
4	0.778	0.758	0.857	0.797
20	0.845	0.824	0.887	0.852
32	0.844	0.828	0.893	0.855

2.4 Serve

For Qwen3, use the vLLM backend:

export EMOTION_STEERING_API_KEY="change-me"

emotion-steering serve \
  --vectors ./vectors/qwen3-8b-mid \
  --model Qwen/Qwen3-8B \
  --backend vllm \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32

How the vLLM fast path keeps batching. vLLM batches many active requests together by scheduling tokens, not whole conversations. The steering code keeps that property intact. The request sends vllm_xargs.steering, vLLM stores it in SamplingParams.extra_args, and the steering runtime reads that metadata while vLLM is building the next scheduled token batch.

The architecture-agnostic part lives in src/emotion_steering/serve/_patches/_steering.py. It loads the saved vectors, wraps GPUModelRunner.execute_model, and builds one steering tensor per chosen layer with shape [scheduled_tokens, hidden_size]. Tokens from unsteered requests get zeros. Tokens from steered requests get the requested weighted vector. That tensor covers the whole vLLM batch, so the server does not split requests by emotion or run one model call per request.

The patched decoder layer is model-specific. For Qwen3, src/emotion_steering/serve/_patches/qwen3.py checks whether the current layer is one of the chosen layers, reads the per-token tensor, and adds it to hidden_states in the same residual-stream space used during extraction. This is model-specific because each vLLM architecture file has its own decoder layer class, residual naming, return shape, and layer-index convention. To add a fast path for another architecture, follow the repo guide at .claude/skills/extend-vllm-fast-path.md.

For non-Qwen3 models, the current fast path does not apply automatically. Use the model-agnostic Hugging Face backend:

emotion-steering serve \
  --vectors ./vectors/my-model \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --backend hf \
  --port 8000

The HF backend is useful for compatibility checks, but it serializes generation. The vLLM backend is the path meant for production-style serving.

3. Use the API

Discover emotion IDs:

curl -H "Authorization: Bearer $EMOTION_STEERING_API_KEY" \
  http://localhost:8000/v1/emotions

Then send a chat request. For vLLM, use vllm_xargs.steering:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $EMOTION_STEERING_API_KEY" \
  -d '{
    "model": "qwen3-8b",
    "messages": [
      {"role": "user", "content": "Continue: The laboratory felt"}
    ],
    "max_tokens": 120,
    "temperature": 0,
    "chat_template_kwargs": {"enable_thinking": false},
    "vllm_xargs": {
      "steering": [1, 1.5]
    }
  }'

The steering list is flat: [emotion_id, alpha, emotion_id, alpha, ...]. For example, [0, 1.0, 2, 0.5] means 1.0x emotion 0 plus 0.5x emotion 2. Negative alphas push away from the direction.

For a live smoke test:

emotion-steering test-http \
  --base-url http://localhost:8000 \
  --api-key "$EMOTION_STEERING_API_KEY" \
  --model qwen3-8b

This calls /v1/emotions, then runs a baseline request and one request per emotion.

4. Operational Caveats

These are the main caveats to keep in mind when running this stack:

The vLLM fast path is Qwen3-specific today. Other model families can still use the Hugging Face backend, but that is a serial compatibility path, not the high-throughput path.
For vLLM, use the request shape shown above: "vllm_xargs": {"steering": [...]}.
If you use a shared Hugging Face model cache, make sure the process can also write dataset/cache files. A root-owned HF_HOME can make extraction fail before the model loads.

Bibliography

Sofroniew, N., et al. (2026). Emotion Concepts and their Function in a Large Language Model. Anthropic's study of emotion vectors in Claude Sonnet 4.5 is the closest interpretability inspiration for treating emotion concepts as functional internal states.¹
Sun, M., et al. (2026). How Emotion Shapes the Behavior of LLMs and Agents: A Mechanistic Study. E-STEER studies representation-level emotion steering in LLMs and multi-step agents.²
Sun, L., et al. (2026). Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control. This is the closest geometry paper for moving from discrete emotion labels toward valence-arousal control.³
Chen, R., et al. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. Persona vectors generalize the same operational pattern from emotions to assistant traits such as sycophancy and hallucination propensity.⁸
Jeong, J. (2026). Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison. A useful methodological comparison for emotion-vector extraction and steering in smaller open models.⁹
Zhang, J., & Zhong, L. (2025). Decoding Emotion in the Deep. A layer-wise probing study of where emotion information appears and persists in Qwen3 and LLaMA models.¹⁰
Koley, G. (2025). SALM: A Multi-Agent Framework for Language Model-Driven Social Network Simulation. SALM is relevant for the longer-term question of persistent affective state in multi-agent simulation.¹¹
Panickssery, A., et al. (2024). Steering Llama 2 via Contrastive Activation Addition. The CAA-style direction used here is a class-mean contrast in activation space.¹²
Demszky, D., et al. (2020). GoEmotions: A Dataset of Fine-Grained Emotions. The default extractor maps GoEmotions labels into Ekman-style groups.⁶
Ekman, P. (1992). An argument for basic emotions. The six target groups follow the common Ekman basic-emotion framing, with happiness represented as joy.⁷
vLLM Project. vLLM. The fast path patches Qwen3 inside vLLM and uses its OpenAI-compatible server.⁴
Qwen Team. Qwen/Qwen3-8B. The example vectors and fast-path patch target Qwen3-8B.⁵

Sofroniew and colleagues identify 171 emotion-concept vectors in Claude Sonnet 4.5 and argue that they causally influence preferences and alignment-relevant behavior, while explicitly not claiming subjective feeling. ↩ ↩²
E-STEER frames emotion as a structured hidden-state intervention and evaluates effects on reasoning, subjective generation, safety, and multi-step agent behavior. ↩ ↩²
Sun and colleagues derive emotion steering vectors from 211k emotion-labeled texts, fit valence-arousal axes, and report replication across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. ↩ ↩²
vLLM supplies the OpenAI-compatible serving layer and continuous batching path. The Qwen3 fast path adds a residual-stream hook to vLLM's model execution. ↩ ↩²
Qwen/Qwen3-8B has 36 decoder layers and hidden size 4096, which is why the example vectors have shape [chosen_layers, 4096]. ↩ ↩²
GoEmotions supplies 27 fine-grained emotion labels. The extractor maps them into six emotion groups and drops mixed-category records. ↩ ↩² ↩³
Ekman's basic-emotions account is the source for the six broad categories used here: anger, disgust, fear, happiness/joy, sadness, and surprise. ↩ ↩² ↩³
Persona Vectors uses activation directions to monitor and control assistant traits, and to predict or mitigate trait shifts during fine-tuning. ↩
Jeong compares generation-based and comprehension-based emotion-vector extraction across nine small and open models, and reports middle-layer localization and causal steering effects. ↩
Zhang and Zhong use probes across Qwen3 and LLaMA hidden layers and report that emotion signals emerge before the final layer, peak around the middle of the network, and can persist across generated tokens. ↩
SALM is not a steering-vector paper, but it motivates affective state as part of long-running agent simulation, including stability and memory considerations. ↩
Steering-vector methods use activation differences to construct an intervention direction. Here, the saved vector is mean(class) - mean(rest) at selected residual-stream layers. ↩