Problems in the AI Eval Political Economy

13. Dez.

Evaluations of new AI models’ capabilities and risks are an important cornerstone of safety-focused AI policy. Currently, their future as part of the policy platform faces peril for four reasons: Entanglements with a broad AI safety ecosystem, structural incentives favouring less helpful evals, susceptibility to misinterpretation, and normalization of inaction.

Introduction

Evaluations of AI models are a mainstay of the AI safety environment: They provide vital insights into models’ capabilities (what they can do - this is most important to how models can be misused) and propensities (what they are likely to do - this is most important to how models might misbehave themselves). Insight from evals might inform developers’ choices around deploying their models. It’s also an important part of policy instruments: Wherever policymakers want to tie limitations and obligations to levels of capability and risk, trustworthy and reliable evals are important. This is especially true given the inaccuracies of other measurable proxies for capability and risk: Without evals, much safety policy is a blunt instrument, prone to stall progress and susceptible to political attack. If policy is to be a pillar of building safe AI, evals have to be a cornerstone. But they face looming political risks.

The Eval Policy Platform

A successful eval ecosystem is broadly dependent on two kinds of policy support: Evals as policy instruments, e.g. using eval results as a basis for product standards, release conditions, etc; and eval ecosystem support, e.g. through grants and state contracts. This distinguishes safety-focused evals from assurance services focused on a narrower band of capabilities and risks. Those often need neither angle of political support, because they are closer related to myopic commercial interest and are supported by organic market demand.

But evals do need that support if they’re going to be valuable to policy, and so far, they’ve received at least some of it. This success is the result of genuinely impressive technical work and effective advocacy. It’s also due to evals being a relatively inoffensive policy platform: They have the air of scientific neutrality, they are in principle palatable to all political groups around AI, and as long as requirements to comply aren’t too high, they place fairly little burden on industry. As such, building policy around evals is a key area of focus for safety-oriented AI policy. I believe this policy agenda faces rougher headwinds in four areas – and try to provide some brief ideas for the future course.

These are not comments on the technical content and usefulness of evals. Others are much more qualified to comment on those. There is a seperate debate to be had about the insights that evals can provide independent of policy frameworks — this is not a comment on that debate, and not an endorsement of evals as a technical tool for AI safety.

Entanglements and Conflicting Interests

*The kind of* *headline* *you don’t want to see.*

First, eval organisations are caught in an odd web of incentives and associations. Currently, there’s a substantial overlap between the AI safety policy ecosystem and eval orgs. That’s for good historical reasons; safety-focused funding ensures orgs can focus on the kind of horizon-focused eval that the narrow assurance market cannot cover, and guaranteed their survival in the early days of the advanced AI ecosystem. But that structure could prove to be a political burden because it prompts visible conflicts of interest. Many prominent eval orgs are funded by the same sources that also fund ostensibly independent policy organisations pushing for evals as policy instrument - in effect advocating for their sister organisations’ ongoing funding and employment. Transitioning to a for-profit eval ecosystem is a treacherous path as well – California AI safety leader Dan Hendrycks’ parallel involvement in the drafting of SB-1047 and for-profit eval company Gray Swan was ultimately remedied, but arguably cost a lot of intermediary political support.

This is readily available and very effective political ammunition for opponents, and I would strongly expect future policy discussions to include safety skeptics drawing lines between policy advocates and eval orgs that stand to benefit, e.g. via funding links, personal histories and current entanglements. If there is to be any hope that evals take a prominent role in informing and implementing policy, this vulnerability should urgently be reduced. The way forward might be a clearer separation behind the technical services that eval companies provide and the political discussion around where they should be employed. In that environment, eval companies would be stricter for-profit entities with less mission association; and other stakeholders with no stakes in eval industry, from governmental AI safety institutes to third-party civil society, would employ and thereby direct their services. This would seperate ‘doing good evals’ more clearly from ‘obligating others to order evals’.

Incentives Favour the Wrong Evals

Second, the current environment doubly incentivizes dramatic capability evals. First, findings on what new, crazy, scary thing a new model can do play well with media, founders, and on social media. This places pressure on eval organisations to prioritise evals that could lead to those results, and amplifies the parts of their work that focus on them. Propensity results are much harder to communicate in that way: Reporting and perception might, at best, differentiate between a >0% (‘model could’), a >10% (‘model can’) and a >90% (‘model will’) result, but the headline is almost as good with 1% as with 50% — there’s just very little attention on the market left for the step to more differentiated results. Second, capability findings play well for developers – for instance, it seems obvious that OpenAI’s claims to be on track to AGI benefit from eval results confirming concerning degrees of scheming and planning ability; or that cybersecurity or biosecurity risks implicitly point at respective capabilities. Developers can boost these findings, e.g. by how they distribute and provide eval access to their models, by what to feature on their system cards, etc. So far, they have very effectively played on the edge of allowing, using, and boosting somewhat scary capability results to their own advantage. This is a risk because it invites slowly getting used to increasingly alarming eval results, as discussed further below; and because it places evals under suspicion of collaboration. The more eval results appear as part of a carefully calibrated PR system of frontier labs that draws on AI safety worries, the less they’ll enjoy unequivocal support as policy instruments — and the supposition of such a hype-generating safety ecosystem is neither new nor rare.

All of this is not entirely the fault of eval organisations. Much has to do with media reception, political instrumentalisation and commercial interest. But they’re still the ones who can most readily change this political economy for the better.

Evals will be in a weird spot between burden and boost to model releases until a more structured policy framework around evals that hinges less on the voluntary decisions of developers is introduced. Until then, the production and proliferation of striking capability results can boost the development of potentially unsafe models and weaken evals’ strategic position down the road. It might also come at the expense of potentially interesting academic propensity work or narrower and more deterministic harm vector examinations (others have talked about this at much better length and detail). This is all unfortunate on its own, but it becomes a big liability in conjunction with the next point:

The Rabbit-Duck-Problem

*Left: Evidence that models will soon exfiltrate and take over. Right: Evidence that someone has asked a model to mimic a rogue AI.*

Third, many eval results invite drastically divergent interpretations of the same findings - a ‘rabbit-duck-problem’. Statements of the type ‘a model does x under conditions y’ reliably runs into the same reception: Whoever was worried about the model ‘doing x’ sees the eval as a confirmation that the model can indeed do x - a dangerous capability increase! But that view is often mistaken: The possibility of eliciting dangerous model capabilities is often simply a corollary of general and unconcerning capability increases, whereas the concerning element lies in propensity or system-level accessibility of these capabilities. Whoever was not too worried about these capabilities, on the other hand, will focus on condition y - of course the model demonstrated a dangerous capability, it was prompted to do so! That interpretation is likewise also sometimes a bit simplistic – AI models often get prompted to do things in the wild as well, and we presumably do not want to widely deploy models that are one bad prompt away from dangerous scheming and self-exfiltration.

But as long as this rabbit-duck-problem persists, evals are of limited use to inform policy. For evals as a criterion for applying restrictions, this much room for interpretation makes agreeing on concrete thresholds less likely: For instance, imagine trying to define a scheming threshold beyond which a model should not be deployed or gain access to the internet. With equal conviction, safety advocates will argue that this threshold is reached when the scheming is elicited, and regulation critics will argue the threshold is reached through toy examples. For evals as a broader contribution to policy discussion, the vagueness makes for bad evidence; reliably, eval results that allow for such different interpretations will be pointed at by some and dismissed by others.

To support friends of evals in upcoming policy discussions, a shift in the content of ‘headline evals’ that make it into abstracts, media outlets, system cards and press releases would be very helpful. Loss of control and unreliability might be better covered with a stronger focus on propensity assessments, which leave less room for the ‘you asked the model to misbehave’-objection. Misuse might be better covered with not just an independent capability and propensity note, but also more direct connection to counterfactual impact (e.g. a greater focus on explaining the possible harm arising from a models’ likely help in cyberattacks). To be clear: This is a political wishlist, and it might not work too well with the technical agendas and limitations of eval organisations. But it might be useful to keep in mind in charting the future course.

Normalizing Inaction

The more evals you do that don’t have any consequence on deployment, the less impactful future evals will be.

Fourth, focusing on eval capacity buildup and anchoring evals into the default release process without active progress toward binding and operationalized structure can be harmful: the current environment is getting everyone used to not reacting to eval results. Currently, there is no active jurisdiction in the world that requires new models to clear specific eval results before release. Voluntary commitments by developers are also often vague and rarely genuinely constraining. As a result, model releases are increasingly often accompanied by potentially fairly concerning eval results – that have absolutely no tangible impact on the models’ deployment. Because eval organisations are aware of their limited say on deployment decisions, there are rarely ever concrete recommendations on if and how to deploy the model; the eval results are just provided, and the deployment runs its course.

This erodes norms of approrpriate reactions to evals. This is part of a broader issue plagueing AI safety, but it becomes obvious here. There’s two plausible ways to read whats’s going on with evals today:

Either eval results are indicative of seriously dangerous current capabilities (as some reactions would suggest). In that case, it would be a veritable scandal that models are still being developed and deployed without oversight. Eval organisations should then be much more aggressive about suggesting to cease development and deployment immediately, communicate their concern and objections to the public and policymakers, etc — both to protect from AI risks and to protect their credibility. (I don’t believe this is true.)
Or eval results are indicative of potentially noteworthy trajectories, and will only be informative regarding current risks sometime later. If that is true, current AI model releases are acceptable from the safetyist view. In that case, the dramaticism around their ‘risky’ capabilities is way overblown and provides plenty of reasons to discount future, more genuinely alarming results. The entire safety ecosystem would do very well to be very cautious about campaigning on these sorts of findings, then, or be accused of having cried wolf yet again.

In either case, the current combination of brakes-off deployment supported by dramatically reported-upon capability results will cost credibility.

This dynamic risks providing future justification for ignoring evals. There are easy-to-draw, hard-to-debate (false) equivalences. For instance, imagine a policy discussion around deferring to concerning evals for deployment decisions; and imagine that o1 causes no substantial damage until then. Regulation skeptics will be quick to point out that eval orgs had been skeptical of o1 too, but it had been released anyways and caused no harm, so their opinion should not be given much weight.

Both these issues would be much more manageable if evals culminated more visibly into some tangible recommendations ranging from ‘don’t release this at all’ to ‘release this with no limitations’. Similarly, they might culminate in a clear assessment whether the eval is relevant for assessing the model or the trajectory, i.e. distinguishing between ‘we think this eval is relevant to assessing this current model’ and ‘we think this eval is relevant to assessing technological progress toward future models’. That recommendation doesn’t necessarily have to come from the eval orgs themselves, but could follow some result aggregation mechanism or neutral synthesis – lots of options come to mind. The important part is that there is a clear way in which external parties can assess whether anyone has in fact ‘cried wolf’ for a given model. This makes the wolf-crying distinct and actionable, and it reduces risks of alarmism allegations. Understandably, this is tricky terrain for eval organisations, as it brushes against the edges of their mandate and requires them, or someone, to make broader normative determinations beyond empirical observation. But as long as evals don’t provide tangible recommendation classes, chances that they’ll continue to be impactful drop with every vague inaction.

This is also why the impressive inroads made by the evals field, from increased funding and technical skills of eval orgs to increased media attention and increased integration into model release pipelines, are not necessarily progress all things considered. When evals become a cornerstone of a currently somewhat defective policy ecosystem, their credibility as future basis for fundamentally different frameworks suffers. More uptake of evals in the current situation might make things worse.

Is AI safety going to be alright?

...

Yeah, we just need to do more evals.

...

Is AI safety going to be alright? ... Yeah, we just need to do more evals. ...

Conclusion

Evals are an important cornerstone of designing safety-focused AI policy without undue burdens or imprecise instruments. But their future as a policy instrument faces peril: The current organisational structure of eval organisations is politically vulnerable, incentives point evals towards sensational results susceptible to misinterpretation, and their unclear current role provides precedent for ignoring them. Changes toward a clearer division of labor that includes eval organisations with no entangled policy interests and stronger mechanisms to tie evals to tangible recommendations and predictions would be a very welcome setup for the policy discussions ahead.

I try to post updates on my writing, but not on much else, on X / Twitter.

Anton Leicht