Alignment & Succession: The Two Bars of Alignment
Why much AI alignment discourse automatically treats the transition to powerful AI as a succession problem
When people talk about AI being “aligned”, I think there’s a lot of conflation between two different bars of success:
The AI does what you say and does not go rogue. If you ask it to do a machine learning experiment for you, it actually does that instead of scheming against you and escaping onto the internet. If you say “stop”, it stops. (Some AI properties considered important for achieving this bar are corrigibility, non-deceptiveness, and intent alignment.)
The AI has fully internalized our values, and could run society, and the lives of the humans in it, in a way that we would judge as good. You just ask it to build the ideal society, and it figures out what utopia is and builds that for you, and it really is utopia. (Some concepts related to this: value alignment of the AI, the AI achieving humanity’s coherent extrapolated volition)
The first might sound like a very low bar, and it is. It’s also a standard we’re familiar with from other technologies. With nuclear reactors or airplanes, we ask the question “will it blow up in our faces?”. Just like it’s possible to build a nuclear weapon, it’s possible to build an AI that blows up in our faces very badly. The difference with AI is that there’s a case for it being much more powerful, and our understanding of whether it blows up in our faces is much weaker than that for nukes. With nukes, after all, we have nuclear physics that works very well. With AI, we have long theoretical arguments about what superintelligence would look like, whether it’s possible, and whether it is extremely likely or extremely unlikely to blow up in our faces.
The second might sound like a very high bar, and it is. It’s also a bar that we’ve not previously had to think about when talking about technology. If we’ve thought about it, it’s been in the context of arguing about culture, politics, or institutions.
Now you might say: what do you mean the AI is meant to run our entire society and all of our lives? Isn’t that a ridiculous strawman of what people want AI to do? And perhaps it would be good if it was, but it’s a common view among people who expect AGI soon (and especially, perhaps, the AI safety community). On LessWrong, my posts have previously encountered opposition to the idea that it’s worth doing any planning or design of the post-singularity society. Why would we, with our weak fallible human minds, waste our time designing paradise, when soon we will have a god—sorry, I mean a superintelligence—that does it for us? A common view I’ve heard is that all that matters is that we handle the transition period, and build aligned superintelligence, and then we delegate all of our decisions to it.
Will the superintelligence implement democracy? Will it rely on markets? These are just earthly institutions, the view goes, that we humans have built as kludges. Probably the superintelligence will transcend all such things.
Does this mean a tyranny where the superintelligence micromanages everyone’s lives? Well, it’s superintelligent. If that is good for us, that’s what it’ll do. But to the extent that we value choice, it will give us choice. So no need to worry about tyranny!
What is the nature of the good life? The superintelligence will solve this question. After all, it’s superintelligent! Probably—some think—this looks like the superintelligence discovering some optimum configuration of matter, and then tiling the universe with that. But if that’s not the case, don’t worry, the superintelligence will have figured it out and it will just go out and do the right thing. If you want to know the nature of the good life, you just have to ask it, and it will tell you.
In his review of Peter Singer’s commentary on Marx, Scott Alexander writes:
[...] Marx was philosophically opposed, as a matter of principle, to any planning about the structure of communist governments or economies. He would come out and say it was irresponsible to talk about how communist governments and economies will work. He believed it was a scientific law, analogous to the laws of physics, that once capitalism was removed, a perfect communist government would form of its own accord. There might be some very light planning, a couple of discussions, but these would just be epiphenomena of the governing historical laws working themselves out.
I think this attitude is ridiculous, for reasons I will soon get into. In particular, what’s weird is that the AI not blowing up in our faces is treated as approximately of the same difficulty as the AI knowing everything about what is good and right. But if you look at where it came from, this makes sense—given a very particular set of assumptions.
Like much else in AGI discourse, it all comes back to Eliezer Yudkowsky. Yudkowsky’s central case for doom comes from a combination of two points:
The fragility of value. It’s hard to pin down what human value is. Any simple metric, maximized ruthlessly will destroy everything we care about.
The coherence of superintelligence. It is the nature of intelligence to be an expected utility maximizer. Anything smart enough will tend towards optimizing some utility function ruthlessly.
Therefore, argues Yudkowsky: aligning superintelligence is very hard, because unless you very precisely capture all of human value—or at least precisely capture notions like corrigibility that are deeply unnatural to express in clear math to an expected utility maximizer—it will tile the universe with something slightly wrong, and, because value is fragile, all value will be destroyed. (See here for more related discussion)
Note that under this view, superintelligence not blowing up in our face and superintelligence perfectly knowing the good life are of nearly equal difficulty, because whatever the AI thinks is the good thing, it will convert the entire universe into that. We must surrender the world and ourselves entirely to the superintelligence, because the superintelligence will accept nothing else—not out of malice, but just because that sort of totalizing utility maximization is just what “superintelligence” physically means, just as “a super-hot temperature” means atoms must be moving very fast. And therefore the superintelligence must be morally perfect—or at least, capable of becoming morally perfect without any further human input—because otherwise it will obviously destroy everything.
If you want to feel the alignment difficulty that the most hardcore “doomers” feel, you only have to consider that they think the checklist looks like this:
Solve moral philosophy, finally & forever
Impart the correct moral philosophy to the AIs
(These days, it’s common to insert a step 0: build a very-capable-but-not-quite-too-dangerous AI, to then achieve step 1 for you.)
There are a lot of technical reasons why step 2 is difficult, especially given the poor match between the great subtlety in human values and our crude methods for steering machines. However, even if there were exactly zero technical difficulties with step 2, the plot would already be lost—step 1 is the problem!
Succession is hard, value is fragile
Whenever power switches hands, you have a succession problem.
When I say “succession”, I mean a transfer of power from one entity to another. The usual - and preferred - state is that you have power, which you use to manage whatever it is you manage (your life, your team, a country, a company, etc.) and you keep that power. Sometimes, however, you are forced to hand off. The dying need someone else to take over their affairs. You might change jobs or be promoted and hand off your team to be managed by someone else. Your country needs a way to transfer executive power, either because the current executive died or because frequent executive transfers were enshrined as an anti-power-concentration measure in a democracy’s constitution.
Succession problems are hard. They doom dictatorships, make every transfer of the CEO role existential for a company, and make older generations complain about the moral degeneracy of the youth.
In general, you should minimize the amount of succession you need to do, over the things whose management you actually care about, when what you want to achieve with the thing is not narrow and simple. You care about your life, so you never hand over control unless you’re literally incapacitated and dying. If you care about a company or its mission, you try to retain control.
Superintelligence, by assumption, dooms humanity to powerlessness, since the superintelligence is assumed to hold all power itself. And so the default discourse in AI safety, and in AGI-conscious AI governance, is fundamentally about succession. The hope is not that our current world order survives, or that humans continue to wield power. The hope is that, even though all existing governments and organizations and institutions and norms are swept aside, we’ll start with a clean slate that is good, designed by a benevolent superintelligence.
This is the worst succession problem of all time.
One reaction you can have towards this is that humans should not be disempowered. Steering technology and policy towards that may seem hard, but is also criminally neglected – I’ve written about options before and tried to do something about it.
Another reaction is to ban everything related. This would require very high levels of international coordination, and this coordination is made harder by the very wide distribution of beliefs over whether superintelligence will be good or bad, or controllable or uncontrolled. There’s also a genuinely hard tradeoff between power and welfare, if we take both succession or disempowerment concerns and the expected welfare effects of radical AI (e.g. curing cancer, eradicating death, augmenting intelligence) seriously. Historically, it is both the case that your disempowerment is quite bad for your chances of getting anything, but also that from today’s vantage point we often endorse many of the moments when our ancestors gave up power in ways that increased growth (e.g. giving up political control by joining a larger polity, or accepting the extinction of jobs or offices; Tyler Cowen for example has an aggressive general argument in this vein). Then again, our ancestors, by virtue of having us as descendants, weren’t the unlucky ones.
A third reaction is: we’ve just been handed an assignment. The reality of the situation is that we just have to figure out for once and for all what we want, and therefore the assignment we must pass is successfully codifying that. Now we’re good students, so of course we’ll do our homework. How hard can it be, after all? We’ve spent all our lives studying a wide variety of subjects—all the way from mathematics to programming to even physics—and we’ve learned that the way you solve every problem is that there are some ground-truth logical principles, and you do some analytical reasoning to understand and apply them, and then you get a generalizing solution. Now yes, this problem is a tricky one, but hey—we have at least three years to solve it! (Oh wait, did Anthropic just drop a new model? Maybe two years. Better work harder!)
The problem with this is that “what we want” or “what we prefer” or “what is the nature of the good” are not a homework problems with nice clean generalizing solutions, that you can write down once in closed form and then walk away from.
Alignment thinkers worry about succession
All the actually serious thinkers recognize this problem and are deeply afraid. Paul Christiano, an alignment researcher with one of the best track records in the field for thinking clearly about topics way before they become mainstream, said in his interview with Dwarkesh Patel:
Paul Christiano: [... One goal we could have is that] we want to pass off the world to the next generation of machines [...] [In the same way that t]here are some people, we like them, we think they’re smarter than us and better than us. [...] I think that’s [...] a huge decision for humanity to make. And I think most humans are not at all anywhere close to thinking that’s what they want to do. If you’re in a world where most humans are like, “I’m up for it, the AI should replace us, the future is for the machines”—then I think that’s a legitimate position that I think is really complicated, and I wouldn’t want to push go on that, but that’s just not where people are at.
I do not right now want to just take some random AI, [and] be like, yeah, GPT-5 looks pretty smart [...] let’s hand off the world to it. And it was just some random system shaped by web text and what was good for making money. And it was not a thoughtful [moment where] we are determining the fate of the universe and what our children will be like. It was just some random people at OpenAI [who] made some random engineering decisions with no idea what they were doing. Even if you really want to hand off the worlds of the machines, that’s just not how you’d want to do it.
Dwarkesh Patel: I’m tempted to ask you what the system would look like where you’d think: yeah, I’m happy with what I think, this is more thoughtful than human civilization as a whole, I think what it would do would be more creative and beautiful and lead to better goodness in general. But I feel like your answer is probably going to be that I just want this society to reflect on it for a while.
Paul Christiano: [...] I’m just not really super ready for it. I think when you’re comparing to humans, most of the goodness of humans comes from this option value [of people thinking for a long time]. And I do think I like humans now more now than 500 years ago, and I like them more 500 years ago than 5000 years before that. So I’m pretty excited about there’s some kind of trajectory that doesn’t involve crazy dramatic changes, but involves a series of incremental changes that I like. And so to the extent we’re building AI [...] I want to preserve that kind of gradual growth and development into the future.
Or consider Eliezer Yudkowsky, the arch-prophet of AI doom. Infused in all of Yudkowsky’s writing—when it gets to moral questions about what ought to happen, but not when it’s about what superintelligences will be like—is a deep respect for the complexity and fragility of human value. The single statement he called out as the pinnacle of his writings is:
Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth.
In the same piece, he writes:
If the chain of inheritance from human (meta)morals is broken, the Future does not look like [an unimaginably strange galactic civilization that is still wonderful]. It does not end up magically, delightfully incomprehensible.
With very high probability, it ends up looking dull. Pointless. Something whose loss you wouldn’t mourn.
He even waxes lyrical, very reverently, about the long chains of well-handled subtle succession that have given us our values, and about them hopefully stretching unbroken into the distant future.
Or consider Will MacAskill and Toby Ord, some of the originators of the effective altruist movement. As moral philosophers, they’re keenly aware of the complexity of human value. But as academic moral philosophers, they have a very academic solution to it: the “long reflection”, a period of thinking and reflecting as a civilization, after we solve that pesky issue where the AIs (or bioweapons or nukes or whatever) might kill all of us, but before we make the final decision about what we do with the universe. If this sounds hard, don’t worry—they’ve budgeted up to tens of thousands of years for the process.
There’s lots that’s wrong about the idea of the long reflection, and the morality-as-a-homework-problem vibe behind it. Robin Hanson has called it a “crazy bad idea”. Felix Stocker has an excellent analysis of the underlying cultural currents here.
But here’s what the concept of the long reflection does tell you: when academic moral philosophers are forced to consider the problem “what thinking do we need to do before we’re happy making our final choices about what everything should be like”, they suddenly get cold feet. As academic philosophers, presumably they’ve been tortured countless times by debates at seminars running hours over schedule—and yet here they are, bravely telling us that civilization might need to spend tens of thousands of years debating such things before we should be confident enough to lock it in.
Actually, never mind any of that. Yudkowsky writes that seeing the fragility of value discussed above as obvious requires an “immense amount of background explanation”, the rehashing of which would “take us [...] through 75% of [his] Overcoming Bias posts“ (hopefully we don’t have to do that during the long reflection—we’d run out of time!). But ... isn’t this the most obvious point in the world? If you’ve ever struggled with an important decision, you know how non-obvious the right thing can be. If you’ve ever been to a party with bad vibes, you understand how even very subtle things can make an experience much worse. If you’ve ever noticed yourself still getting better at understanding your partner’s preferences months or years into a relationship, you know how much depth there is to what a human likes. If you’ve paid attention to your own values and preferences over time, you’ve realized how they change in unpredictable ways, based on both reflection and external events and a hard-to-disentangle mix.
Value is fragile. Succession is hard to do right.
Why?
Personnel as policy, and why succession is hard
Can we get any more insight into why succession is so hard, besides general claims about the fragility of value?
It is often said: personnel is policy.
What does it look like to not understand that?
Imagine a CEO trying to find a successor. “I’m about to hand away all my power”, they think. “So I better think carefully here: I should specify all the things I care about and how I make value judgements, and then just hand that document over to my successor.”
Even assuming that the successor actually follows this document, intuitively we don’t expect it to go very well. What value judgements to make, what things to care about, how to make tradeoffs—these are not easy to specify in explicit language. Now imagine how much worse it is if, instead of the successor being a human, the successor is something we don’t have much in the way of either theory or data on.
Now: there’s a lot of talk about corrigibility in AI alignment. What that problem looks like in the above situation: the CEO hands over power, but the successor promises to listen to corrections from the ex-CEO. One day the successor makes a call that the ex-CEO disagrees with, so the successor gets a call with some feedback and a request to reconsider. The successor says: “Sure, you’d have handled it that way, but now I’m in charge and I don’t need to listen to you anymore”. This represents the problem—corrigibility is not a natural property, either of abstracted agents or your average in-the-flesh human CEO. Of course, every good governance system is all about managing corrigibility; in democracies the people can force the government to listen to them by being able to fire the current government, for example. But corrigibility is hard and messy to implement, and whoever is currently in power always has incentive to chip away at it.
What does successful succession look like? Remember: personnel is policy. What you’re picking is not someone to execute an algorithm or maximize a metric you leave behind, but someone who carries enough of the right stuff in their head to orient towards the right things. You want to find someone who is attuned to what you want, who has the same intuitions about good and bad, proper and improper, reckless and timid. Ideally you know this person deeply. If you care about the outcomes—if, as the CEO, the company is not just an asset you own but something you’ve invested your life in—you’re probably willing to take fairly sharp trades that trade raw intelligence for “alignment”, in this deep and holistic sense. So what if they’re a genius, if they’re just going to make costs come down while slowly draining the company of its culture and mission? And, of course, it helps a tremendous amount if it’s a human, since then so much about intuitions and values and desires clicks by default. We’ve literally never dealt with succession to non-human-containing entities before, over anything important! Consider how often succession has gone wrong in normal human institutions without anything like such a distribution shift!
The general case is this: achieving what you want is, far more than you’d think, a process of trial & error that requires adjustment & learning & change, rather than a straight-shot to a goal. We understand less than we think about what we want and how to achieve it.
Why is personnel policy? Because in the course of carrying out a job, even if handed a spec, you will have to fill in a lot. And how you fill it in depends on who you are.
Reasons to rush to succession
So: by default, try very hard to avoid getting into a succession problem. If you get into one, you are already having a bad day.
So why are so many in AI, even when well-intentioned, racing towards solving succession problems? Because they fear that we have to - we’ll create AIs so vastly superintelligent and capable that we can’t do anything but let them take over. The reason for why differs:
The original Yudkowskian coherence argument: the AIs will naturally be goal-seeking expected utility maximizers, and not very corrigible since corrigibility is very hard. The stance towards succession here is that it is deeply unfortunate and the thing that makes the whole problem hard, but (unless we solve corrigibility, which seems impossible) we must succeed so hard we’d be happy with control succession.
Competitiveness: maybe we want to keep humans in the loop, but those people over there will let the AIs run their lives/companies/governments, and thus accumulate power and resources faster than us (by the assumption that the AIs will be smarter and thus better at accumulating power and resources). Therefore, we’re stuck in a trap where the level of succession inevitably ramps up. This is deeply regrettable but just how the world will work.
Value left on the table: we shouldn’t want to keep humans in the loop too much, even though humans should still set the goals and remain beneficiaries of whatever is going on in the world, since then we won’t realize all the potential gains of having AIs run everything, and optimize our world far better than we could. There is a fuzzy line between this and control successionism (which I defined here).
Full successionism: the succession and handover is the goal. Trying to have humans around setting goals is a problem, and the more enlightened option is to defer to the AIs, which have a greater right than us to exist, flourish, and determine what the universe looks like. This is at least aggressive control successionism, and often experiential successionism (defined here).
The first two positions treat successionism as deeply regrettable. The third ranges from seeing control successionism as outright good (often combined with pessimism about humans) to just something that should be selectively applied, which in turn often shades into very reasonable AI-enabled institutional reforms.
There is therefore a line from the original Yudkowskian ideas about alignment to successionist thinking, and both share a certain resignedness to succession, but along the path the attitude towards successionism flips from regrettable to good. (Though note that alignment proposals like Russell’s assistance games, Drexler’s services, or Beren Millidge’s modular mind merging avoid succession flavor, and much prosaic alignment work is mostly about achieving the first lower bar without thought to the second bar).
Successionism rests on assumptions about what is valuable. Against successionist claims about how value works, in the next post I argue instead for a stance that might be labelled “value mysticism”, and explain how this more uncertainty-acknowledging, experience-grounded vision of what value is implies, regarding successionism, that you really shouldn’t do it. Successionism is silly because it assumes away a lot of the philosophical and empirical complexity of value. It is like replacing a car engine with a loudspeaker that makes vroom vroom noises. It is straightforwardly opposed to many everyday moral intuitions, and moral intuitions matter.
The silliness of successionism does not mean that humanity won’t or shouldn’t change radically. But, as often in life, there are more and less silly ways to do something. Humanity, humility, and liberty are good leading lights in that project. How to go about that is what I’ll talk about in the last post in this sequence.
Thanks to Xavi Costafreda-Fu, Aniket Chakravorty, @softminus, Elsie Jang, Yudhi Kumar, Luke Drago, and Arthur Conmy for feedback.


