How We Measure the Predictiveness of Agents: Eigenwelt Labs

A human-behaviour simulation can be judged in several ways: by how well it predicts what real people do, by how realistically human it feels or by how much it feels like a consistent character over time. You can analyze how much real variety it covers and whether its internals look anything like ours.

In order to evaluate a simulation's predictive quality we can choose between a variety of evaluation metrics, depending on the subject and behaviour we want to predict. In recent papers, an agent is called predictive for several different reasons: it matches one person's answer, estimates a score for that person, reproduces a group's answer distribution, or forecasts how an intervention changes behaviour.

Those are all separate empirical claims. Each needs a different comparison against human data, and each uses different metrics. Before we can understand a result, we need to know what subject and behaviour are being predicted, what they are compared against, and which metrics were used to produce the evidence.

This article explains how predictive quality is measured and what each metric actually means. It is meant to create some shared language around the methods used to evaluate behavioural simulations.

1. What researchers try to predict

Most papers in this area predict one of four things.

One person's categorical answer. The agent is asked the same multiple-choice or forced-choice question as a real person. The measurement asks whether the agent chose the same option. Park and colleagues use this setup for 1,052 agents grounded in two-hour interviews, structured surveys, or both.¹

One person's numeric answer. The agent predicts a number: a Big Five score, a confidence rating, a willingness-to-pay value, a trust-game transfer, or another continuous response. The question is not just "same option or not", but how far the prediction is from the human value.

A population distribution. The agent system is used as a synthetic sample. The measurement asks whether the simulated population produces the same answer shares as a real population: 63% choose A, 24% choose B, 13% choose C. SimBench and OpinionQA are examples of this style of evaluation.²³

The effect of an intervention. The model predicts how behaviour changes when something is altered: for example, how much switching a form's default from opt-in to opt-out shifts sign-up rates, or how much a reminder email changes payment behaviour. The question is the size and direction of the change, not the absolute level. Lippert and colleagues test this by asking LLMs to forecast the results of published behavioural-science studies, then comparing those forecasts to the real human outcomes and to forecasts from human experts.⁴

These four are not interchangeable: a model can match a population distribution while failing to predict the individuals inside it,⁵ and a model can forecast how much an intervention shifts behaviour in published studies without being a good simulator of any particular person.

2. The main metrics

A result reported as "75% accuracy" means something different from a result reported as "low distributional distance" or "high effect-size correlation". In the following section we explore the different metrics and explain what they actually mean.

2.1 Accuracy: same answer or not

Accuracy is used when the prediction is a categorical answer from one person.

\mathrm{accuracy} = \frac{\#\mathrm{\ correct}}{\#\mathrm{\ items}}

If the agent matches the person on 64 out of 100 forced-choice items, accuracy is 0.64.

The hard part is interpretation. Humans are not perfectly self-consistent. If the same person answers the same survey twice, they may only match themself 80% of the time. That test-retest number is the ceiling. Normalised accuracy divides model accuracy by that human ceiling:

\mathrm{normalised\ accuracy} = \frac{\mathrm{model\ accuracy}}{\mathrm{human\ test\text{-}retest\ accuracy}}

In this example, 0.64 / 0.80 = 0.80. The agent predicts the person 80% as well as the person predicts themself. That is much more informative than raw accuracy alone. Park, for example, reports normalised General Social Survey accuracies of 0.83, 0.82, and 0.86 depending on the grounding source.¹

2.2 Error and correlation: how far off is the number?

For numeric answers, researchers usually report error, correlation, or both.

\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^{N} |\hat{y}_i - y_i| \qquad \mathrm{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(\hat{y}_i - y_i)^2}

Mean absolute error is the average miss. RMSE punishes large misses more strongly. These are useful when the size of the miss matters.

Correlation answers a different question: does the model rank people or items in the same order as the human data?

r = \frac{\operatorname{cov}(\hat{y}, y)}{\sigma_{\hat{y}}\sigma_y}

This distinction matters. A model can rank customers correctly while systematically overestimating every score. That may be useful for prioritisation, but it is not the same as accurate absolute prediction.

Pataranutaporn and colleagues test LLMs that predict individual well-being on a 0–10 scale across 64,000 people in 64 countries.⁶ To see why MAE and correlation can disagree, imagine a small slice: five people whose real well-being scores are 3, 5, 7, 8, 9. A model predicts 5, 7, 9, 10, 10: biased upward by about two points across the board. MAE is 1.4, but the correlation is 0.99: the model has nearly perfectly identified who is happiest, just with every score shifted up. Useful if you want to find the most-well-off in a country; useless if you want to know how happy any single person actually is.

As with accuracy, individual-level correlations should be read against test-retest reliability:

\mathrm{normalised\ correlation} = \frac{r_{\mathrm{model,human}}}{r_{\mathrm{test,retest}}}

Park reports a normalised Big Five correlation of 0.80 for interview-grounded agents.¹

2.3 Probability scores: did the model assign probability to the human answer?

Language models output a probability for every possible next token. Negative log-likelihood (NLL) scores how much probability the model assigned to the answer the human actually chose. For a multi-token answer it sums over the answer's tokens:

\mathrm{NLL}_i = -\log p(y_i^* \mid x_i) = -\sum_t \log p\bigl(y_{i,t} \mid y_{i,<t},\, x_i\bigr)

Here $x_i$ is the context (prompt, stimulus, prior tokens), $y_i^*$ is the human's answer, and $y_{i,t}$ is its $t$ -th token. Lower NLL means more probability on the right answer: 70% probability on the human's choice gives NLL ≈ 0.36, 10% gives ≈ 2.30, and near-zero blows up toward infinity.

Suppose a human chose option B from {A, B, C}. Model 1 outputs A 50%, B 30%, C 20%: its top pick is A, but it still gave the human's answer 30% probability (NLL ≈ 1.20). Model 2 outputs A 99%, B 0.5%, C 0.5%: also picks A on top, but with NLL ≈ 5.30. NLL distinguishes "wrong but uncertain" from "confidently wrong".

Centaur uses this style of evaluation: cross-entropy loss applied only to the human-response tokens during fine-tuning, then held-out NLL on behavioural paradigms in Psych-101.⁷

2.4 Distributional distances: does the simulated group look like the real group?

Population simulation usually compares two distributions: the real answer shares and the simulated answer shares. Total variation distance is the simplest version:

\mathrm{TVD}(p,q) = \frac{1}{2}\sum_y |p(y)-q(y)|

If humans choose A/B as 63%/37% and the model gives 60%/40%, the distance is small. If the model gives 90%/10%, it is large. This is the right family of metrics when the claim is about a market, electorate, workforce, or user segment rather than a specific person.

Jensen-Shannon divergence (JSD) measures how distinguishable two distributions are by averaging the KL divergence from each one to their midpoint: 0 if identical, log 2 if they share no support:

\mathrm{JSD}(p,q) = \frac{1}{2}D_{\mathrm{KL}}(p\parallel m) + \frac{1}{2}D_{\mathrm{KL}}(q\parallel m), \quad m=\frac{1}{2}(p+q)

SimBench uses JSD and TVD as its primary distributional metrics.²

Wasserstein distance (also called "earth-mover's distance") is the minimum total work, calculated as probability mass times distance moved, needed to reshape one distribution into the other. This makes it useful when answer categories have an order: predicting "somewhat agree" when the human answer is "agree" is closer than predicting "strongly disagree". The Persona-Promise-and-Catch study uses Wasserstein distance to show that richer LLM-generated personas can drift away from real opinion distributions.⁸

2.5 Treatment-effect metrics: does the model predict the change?

For intervention studies, the quantity of interest is the difference between treatment and control:

\begin{aligned} \Delta_{\mathrm{human}} &= \mathbb{E}[y_{\mathrm{human}}\mid T] - \mathbb{E}[y_{\mathrm{human}}\mid C] \\ \Delta_{\mathrm{model}} &= \mathbb{E}[y_{\mathrm{model}}\mid T] - \mathbb{E}[y_{\mathrm{model}}\mid C] \end{aligned}

The simplest metric is direction match: does the model at least predict whether the effect is positive or negative? A stronger metric is treatment-effect error:

\mathrm{TE\ error} = |\Delta_{\mathrm{model}}-\Delta_{\mathrm{human}}|

Across many studies, researchers often report effect-size correlation: do the studies the model thinks are larger also turn out larger in the human data? Lippert et al. report GPT-4 at r = 0.89 against realised effects, close to expert human forecasters at r = 0.87, while GPT-3.5 is near zero at r = 0.07.⁴ Agentic Economic Modeling reports calibrated treatment-effect estimates close to human field experiments after using a small human calibration sample.⁹

A central warning from the literature is that descriptive fit does not imply causal fit. A model can match population answers and still fail to predict how those answers change under intervention.¹⁰

2.6 Open-text metrics

Open text is harder because there may be many acceptable answers. Researchers therefore use weaker but sometimes useful proxies.

LLM-judge alignment asks another model to judge whether the generated text expresses the same underlying state: stance, belief, emotion, intent, or communication style. HumanLM uses this approach and reports a 16.3% relative improvement over response-imitation baselines.¹¹ This is useful, but it should not be the only evidence. The judge is itself a model.

Perplexity is exp(NLL): how unexpected a piece of generated text is under a reference language model. Lower means closer to the reference. Point of Order reports a 67% perplexity reduction in person-specific civic deliberation simulations after action-aware fine-tuning.¹²

Cosine similarity between sentence embeddings is the cheapest open-text comparator. It works only as a secondary metric, because semantic similarity does not measure behavioural fidelity.

3. Baselines and ceilings

For individual prediction, the ceiling is human test-retest reliability. If a person is only 80% consistent with themself, then no serious evaluation should treat 100% as the practical goal. Park's paper is strong partly because it reports individual-level metrics against this ceiling.¹

For population simulation, the comparison is usually a real survey, experiment, or behavioural dataset. A model should also be compared against simple baselines: majority answer, demographic-only prediction, or a non-grounded prompt. If an expensive agent barely beats a simple baseline, the result is weak even if the headline metric looks acceptable.

For treatment-effect forecasting, the useful comparison is expert forecasting or historical baselines. GPT-4's r = 0.89 matters because the same study reports human expert performance at r = 0.87.⁴

4. How researchers build the predictor (personas/models/agents)

Prompted personas. The model is given a demographic profile or written persona and asked to answer as that person. This is the classic "silicon sample" setup.¹³¹⁴¹⁵ It is most defensible for population distributions. Its weak point is individual fidelity: a plausible persona is not the same thing as a measured person.

Self-report-grounded agents. The model receives real material from a person: an interview, survey history, profile, or previous responses. This is the natural method for individual prediction.¹⁵ Its weak point is the gap between what people say about themselves and what they actually do.

Behavioural fine-tuning. The model is trained on human experimental traces. Centaur is the central example: a large model fine-tuned on Psych-101 and evaluated on held-out behavioural paradigms.⁷ This can be strong for population-level cognition and lab-task prediction, but it still needs contamination checks such as LogProber.¹⁶

Latent-state models. The system predicts an underlying state such as belief, preference, emotion, or intent, then generates the response from that state. HumanLM is an example.¹¹ This is attractive because surface text is noisy, but the latent state is often inferred rather than directly observed.

Calibration and synthetic control. The model produces raw synthetic responses, then a statistical layer corrects them using some real human data.⁹¹⁷¹⁸ This is often the most credible route for decision support, because it admits that the model is biased and measures the correction.

Agent-based environments. The model acts inside a structured world: a social network, game, economy, city, workplace, or meeting.¹⁹²⁰²¹²² These systems can study interactions over time. Their weak point is validation: an environment can look plausible while still failing quantitative tests.

Effect forecasting. Instead of simulating each participant, the model predicts what a real study will find.⁴²³ This can be useful, but it should not be confused with individual simulation.

5. Validation patterns in the literature

Researchers typically strengthen a predictiveness claim in three ways.

Held-out fit. The model is tested on people, items, or studies that were not used to build the predictor. This is the basic requirement. Without held-out evaluation, the result may be memorisation or prompt tuning.

Robustness. The result is checked across wording changes, subgroups, tasks, and datasets. This matters because LLM agents can be sensitive to small prompt changes. O'Leary's perturbation tests on Centaur-class models are one example of this concern.²⁴

Intervention. The strongest evidence comes when the model predicts not only a static answer, but the direction and size of a change. Causal-effect studies and calibration papers are important because they test this directly.¹⁰⁹

Centaur is notable because it goes beyond a single benchmark score: it tests generalisation across cover stories, task structures, and domains, and includes contamination checks.⁷¹⁶

6. Common failure modes

The recurring problems are simple, but serious.

No human ceiling. Raw accuracy or raw correlation is hard to interpret without test-retest reliability, expert baselines, or a noise floor.

Aggregate success, subgroup failure. A model can look good on average while being wrong for important subgroups. SimBench and OpinionQA make this visible.²³

Population fit mistaken for individual fit. Matching answer shares does not prove the model has predicted any real person's answer.⁵

Descriptive fit mistaken for causal fit. Matching today's distribution does not prove the model can predict tomorrow's response to an intervention.¹⁰

Prompt sensitivity and contamination. If small wording changes break the result, or if the benchmark may have leaked into training, the score is not clean evidence.²⁴¹⁶

Over-reliance on LLM judges. LLM-judge metrics can help with open text, but they should not replace human-ground-truth measures.

Drift in persistent agents. Agents that persist over time can change in ways that are hard to notice. Stable-persona work treats temporal stability as something to test, not assume.²⁵

7. Questions for reading a paper

When a paper says that an LLM agent predicts humans, the useful reading questions are:

Is the prediction about an individual, a population, or an intervention?
Is the metric matched to what is being predicted?
Is there a human ceiling or expert baseline?
Is the model compared with simple baselines?
Are subgroup results reported, not only aggregate scores?
Has prompt sensitivity been tested?
Has contamination been checked?
Is the conclusion limited to the measurement setup, or does it imply more than the evidence supports?

8. Closing

Agent predictiveness is a relationship between a model output and a specific human measurement. The strongest papers make that relationship explicit: prediction, metric, baseline, ceiling, and failure mode. Once those are visible, the literature becomes easier to read.

Bibliography

Organised by method family.