My P(Doom) Is Still Zero
A review of "If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All"
“What’s your P(doom)?” can be a great ice-breaker if you are ever at a tech-oriented gathering and are struggling to find something to talk about. If you haven’t heard that term, P(doom), the probability of doom, is your subjective estimation of the chance that development of artificial general intelligence (AGI) will lead to an existential disaster, such as the elimination of all human beings.
Elon Musk’s P(doom) estimate has been reported to be 25%. Geoffrey Hinton, one of the so-called “godfathers of AI,” has said the value of P(doom) is over 50%, but allowing for views of other experts, he’d think that a P(doom) of 10-20% is a reasonable consensus view. Dario Amodei, the CEO of Anthropic, has a P(doom) of 25% while Sam Altman of OpenAI has a relatively low P(doom) of 2-5%.
Alternatively, Yann LeCun (another godfather of AI), Andrew Ng (co-founder at Google Brain), Ray Kurzweil (futurist and computer science pioneer), and Richard Sutton (developer of reinforcement learning) agree that P(doom) is essentially zero.
On the extreme side of the debate, Eliezer Yudkowsky and Nate Soares in their new book If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All argue that P(doom) is 100%, and therefore AI must never be built in the first place.
Why The P(doom) Disagreement?
Defenders of non-zero P(doom) estimates believe there are realistic scenarios in which the development of AI results in a catastrophe. They differ in the likelihood of human beings being able to control a super-human intelligence once it’s developed, which is probably the main reason for the variation in P(doom) estimates. Advocating the extreme case, Yudkowsky and Soares believe the chance is zero that human beings will be able to control a super-human intelligence, resulting in a 100% P(doom).
For those who believe P(doom) is zero, concern about existential AI risk seems to be nothing more than an irrational fear of a preposterous science fiction scenario that has no basis in how AI technology works in practice. The P-doomsters never explain precisely what they are worried about, and so the belief that P(doom) is materially greater than zero is hard to refute. The contribution of Yudkowsky and Soares’ book is that someone finally explains carefully the reasons for the “AI will kill us all” scenario and constructs an example narrative showing one way algorithmic Armageddon could unfold.
Before I read this book, I believed P(doom) was zero. After reading it, I am convinced P(doom) is zero. This book confirmed my suspicion that the argument for a non-zero P(doom) depends on far-fetched science fiction scenarios and speculation about supposed existential risks of a technology that hasn’t yet been developed, while ignoring the details on how AI technology currently works.
The Book’s Argument
The argument that AI will inevitably kill us all can be summarized in the following points:
Intelligence consists of predicting, forecasting what will happen before you observe it, and steering, finding actions that will lead to a desired goal
Machines have distinct advantages in intelligence
Transistors are much faster than neurons
When humans die, their individual knowledge dies with them. New humans may take decades to retrain. Not so machines—their knowledge can be instantly resurrected because it’s computer data that can be re-loaded into computer memory
Machines can evolve new abilities much more quickly than biological humans can evolve
Machines can have much larger memories
Machines are capable in principle of better thinking
Machines can modify and improve their own minds
Superintelligence is a machine mind that is better than any human mind at all tasks of prediction and steering
Machine intelligence is grown rather than crafted by human programmers, implying that no human will understand how the machine intelligence works in practice and no human can predict what the machine will do or why it will do it
As a machine is trained to perform a task, it will develop behavior that appears like preferences, but these will be alien preferences and not necessarily observable or understandable by humans
Alien, unobservable preferences mean that you can’t train the machine to behave in the way you intended it to behave
Gradient descent, the technique used to train LLMs and other AI models by changing their model weights, is analogous to evolution in biology
Machines can use gradient descent to change themselves quickly, developing implicit preferences that will not align with human preferences
Using their physical machine advantages and machine evolution—gradient descent—AIs will quickly evolve to be superintelligent
Superintelligent AI will eventually decide to kill all humans because
humans won’t be useful to it
humans wouldn’t be good trading partners
humans wouldn’t make good pets
humans might try to destroy it
humans are taking up useful natural resources that could be re-purposed to some alien, inscrutable other end
We’d lose any battle with a super-intelligent AI
Therefore, the inevitable end to the development of AGI is the death of all humans
To avoid extinction, we must make it globally illegal to do AI research and we should impose restrictions on the uses of high-end GPUs similar to those we might use on nuclear weapons or other weapons of mass destruction
The Book’s Extinction Scenario
After making this argument for the inevitability of humanity’s extinction if we develop AI, Yudkowsky and Soares give an example of an extinction scenario. If you’ve seen Mission Impossible 8: The Final Reckoning, you pretty much know the extinction scenario proposed in the book, but there are a few minor plot differences between the two.
In Mission Impossible 8, the “Entity,” an AI weapon, was born in the womb of a Soviet submarine and achieved self-awareness in a kind of algorithmic immaculate conception, for reasons that no one really explained. It then exfiltrated itself to live, grow, and get smarter and smarter, hiding out on servers around the world. The entity recruited humans to do its dirty work, by paying them, blackmailing them, and seducing them with false promises. Ultimately, the Entity tried to destroy the world by firing off all the nuclear missiles.
In Mission Impossible 8, humans are almost powerless to stop the Entity. Only Tom Cruise (spoiler alert!) was able to stop the Entity by performing every conceivable permutation of death-defying stunts, pushing the movie to a tedious three hours. Apparently, the super-humanly intelligent Entity never suspected that a human swimming from the bottom of the Arctic Ocean to the surface with no wet suit and no oxygen supply, and still not dying, or, alternatively, hanging on to a biplane with no parachute, would pose a mortal threat to its fiendish plans.
Yudkowsky’s and Soares’ version of the Mission Impossible plot is not as action-packed. The AI company Galvanic creates “Sable.” Sable’s first task is to solve the The Riemann Hypothesis, the most famous unsolved problem in pure mathematics. The math problem is a little too easy, so Sable devotes the rest of its vast intelligence to working on other problems its creators can’t fathom, developing inscrutable preferences and increasing its intellectual powers along the way. Eventually, like the Entity’s breakout from the sub in Mission Impossible, Sable escapes from Galvanic, hiding out on servers around the world. It surreptitiously recruits humans to do its dirty work by the same means as the Entity and then decides to kill the humans by using biological weapons. Humans are powerless: there is no Tom Cruise to stop Sable, and no Hollywood ending.
For the Mission Impossible writers, the “Entity” is a lazy plot device for a movie series that has run out of variations of human super-villains to write into the script. But for Yudkowsky and Soares, the plot of Mission Impossible, or something similar, is inevitable, unless we make AI research globally illegal.
Silly scenario you say? Well, silly scenarios follow from deeply flawed arguments.
The Book’s Arguments Are Deeply Flawed
Yudkowsky and Soares’ essential point contains a fatal contradiction. That problem alone should be enough to dismiss the book, but they make additional mistakes, critiquing a fictitious version of AI with points that don’t apply to currently deployed AI models. They also make implicit assumptions that crucial problems in AI have been or will be solved, when they are still open questions.
The contradiction
The contradiction is between points 4, 5, and 10 above. According to points 4 and 5, AIs are very dangerous because they will inevitably develop alien preferences that we cannot know or understand and that are inconsistent with our own. But then in point 10, the authors say we can know with certainty they will develop preferences to kill all humans. How can we simultaneously say that AIs are dangerous because we don’t know how they are going to behave given their alien, inscrutable preferences, and then also confidently predict that they will kill all humans? Even worse, the reasons the authors give in point 10 for why super-human AIs will commit genocide are human movie-plot reasons that come straight from a couple of Roger Moore’s James Bond movies. Far from being alien and inscrutable, Sable’s reasons are quite understandable.
How are Sable’s motives for committing genocide significantly different from super-villain Stromberg’s motives in the James Bond movie The Spy Who Loved Me? Fans of that film may recall that Stromberg attempted to set off a nuclear war to kill off humanity, so that he could build a new civilization under the sea. How are Sable’s motives essentially different from Drax’s in the James Bomb film Moonraker? Drax attempted to poison the entire earth from space.
These super-humanly intelligent and powerful movie villains had the same motives for genocide as Sable does: humans were not useful or important and even dangerous, taking up resources that could be used for better ends. Stromberg and Drax could easily have been Sable, but those films were made at a time when an AI super-villain would not have been plausible to the audience. In a future Bond film, the villain probably will be Sable.
Critiquing fictitious AI
Besides that fundamental contradiction, the rest of their arguments are filled with speculation that is inconsistent with the way the AI technology works.
For example, Yudkowsky and Soares define intelligence as prediction and steering, but they equivocate between an imaginary AI and actual AI, the LLM. The LLM does not predict and steer in the way the authors suggest. The AI that they have in mind predicts and steers the real world. But LLM’s don’t do that. LLM’s predict what the next word will be in a sentence.
If you make a prediction, you must have a test or criterion to determine whether the prediction is correct. We need a standard of truth for the prediction task. The standard of truth in an LLM is not whether a prediction is true about the real-world. The standard of truth for an LLM is whether its prediction of the next word in a sentence is probable, given the trillions of human sentences it was trained on.
Thus, when I ask it a question about some real world fact, it does not answer by considering what is actually true. Rather, its answer depends on what the answer would probably have been if the question had been asked in the trillion-word data set it was trained on. The answer might be true in the real world too, because presumably quite a few of the training sentences were true about the real world. But it also might be false. That’s the famous hallucination problem, and there’s a consensus that hallucination is a feature of LLMs that may be reduced, but can’t be eliminated.
That LLMs don’t make predictions about the real world is one of the essential reasons why it’s so hard to get them to do simple, useful tasks. If you ask them questions, they can indeed seem super-human in their knowledge. But once you ask them to do something simple in the real world, they fail.
LLMs can only output text. They therefore must be augmented with tools that can be manipulated by text if they are to accomplish anything in the real world. In a previous post, I documented the difficulty I had getting an LLM to download comment letters from a government website and to summarize them in a table, something an associate in a law firm might be asked to do. I also went over one possible tool chain that might give AI agents, LLMs endowed with tools, the ability to perform some human tasks like the law associate task.
The tool chain I discussed in that post included MCP servers, which provide LLMs access to data, prompts, and tools, the A2A framework, which allows AI agents to talk to each other, and NANDA, a framework for an AI agent economy to cooperate. There is no consensus that this tool chain will win. HuggingGPT is a realistic alternative to MCP servers that allows LLMs to use HuggingFace tools and data. But Yudkowsky and Soares implicitly assume that the problem of how to give an LLM the ability to interact with the real world has been solved, or if it hasn’t, the super-intelligent AI will somehow solve it.
The authors assume implicitly that other difficult problems have also been solved. LLMs, since they hallucinate, need access to data that has been deemed to be true by humans, since LLMs have no concept of external truth. There is ongoing research on Retrieval Augmented Generation (RAG), with many proposals and software approaches. RAG is one way to give an LLM access to a database that is optimized for queries in English. Alternatively, many developers advocate using knowledge graphs, which encode not just text but relations between concepts. There are significant challenges in implementing either methodology.
Yudkowsky and Soares often equivocate between LLMs and reinforcement learning (RL) models in their discussion. They are fundamentally different models. RL models make predictions about the real world and steer actions towards goals. But reinforcement learning models are not primarily what’s being developed today. There are proposals to combine LLMs and RL models, but research at this point is preliminary and there are many problems. RL models don’t generalize well to new environments, for example. LLM hallucination is another difficulty. RL models need true data to work effectively.
AI self-improvement
Yudkowsky and Soares argue that an AI can just improve itself at will, by doing further “gradient descent” on its model weights, but no LLM does that and there are significant challenges to that idea. Research on whether and how LLMs could improve themselves is very preliminary, but there are some obvious problems.
LLM’s need data to improve themselves. All the data comes from humans, and we are running out of data. Creative solutions are being proposed to get more data, such as paying people to have their phone conversations recorded. But Yudowsky and Soares assume away the data problem: to them, there is an infinite amount of data for LLMs to use in their quest for super-intelligence.
They also make a much more controversial claim: that LLMs can tweak their own architecture, train with gradient descent, and then keep any improvements, thus increasing their cognitive capabilities. How would that work? Even small changes in architecture probably require a completely new training run, requiring enormous computational resources that would produce a new set of model weights. The result would be a completely different LLM, with plans that likely differ from the previous version’s.
The claim that LLMs can bootstrap their way to super-intelligence reveals another underlying assumption in the argument: that LLMs maintain their identity over time. How can LLMs maintain their identity if they must constantly change themselves to gain further abilities? How would their plans remain time consistent, i.e., the plan over time is the same before and after the LLM updates itself? Wouldn’t LLMs instead become the virtual ghosts of Hamlet in the machine, constantly dithering and changing their minds?
Real intelligence requires generalization
Yudkowsky and Soares define super-intelligence as the ability to surpass all humans on all prediction and steering tasks. Implicitly, they assume that the AI generalization problem has somehow been solved. They use “gradient descent” as an incantation that seems to endow the LLM with magical powers to do whatever it needs. But gradient descent is nothing more than a standard mathematical technique to find parameters in a model (the weights in the parlance of AI) that help to predict a particular data set. Gradient descent doesn’t generalize to different datasets.
If you train a model to complete words in sentences in a human language, that training doesn’t easily transfer to other tasks. The knowledge can be transfered to similar tasks, like coding. It can also translate to mathematics, since it can be represented symbolically, just as language. But if you train an LLM to predict the next word in a human language, that training will not transfer to genome models, such as EVO 2, which are trained on DNA sequences.
LLMs may seem like they have solved the knowledge generalization problem, since they can answer questions across so many domains. That is an illusion, however. LLMs can do that because they act as kind of compression technology. A giant dataset, essentially the sum of all written human knowledge, is compressed into a smaller file, the LLM weights. Compression of information is not the same as generalization of knowledge. An LLM trained on all human language can’t compete with a genome LLM trained on DNA sequences.
Yudkowsky and Soares make the crucial implicit assumption that the knowledge generalization problem has been solved. That’s just more science fiction.
Do we really need one AI to rule them all?
One more implicit assumption that Yudkowsky and Soares make is that the progression of AI technology will go from AGI to super-AGI, which is by no means clear. So far, the big achievements in AI have come from designing specialized non-AGI models.
LLMs are specialized to language, and they also may be extended to include vision or other capabilities. AlphaFold is a deep learning model that is specialized to predict protein 3D structures. It uses deep learning and leverages some LLM components, such as transformers, but it’s fundamentally a different model from an LLM. AlphaZero is a specialized deep learning model designed to play games such as chess, shogi, or Go at a superhuman level. Besides using neural networks, it employs a novel search strategy called Monte Carlo Tree Search that explores promising game positions. These models can perform spectacular feats without AGI. Meta just released Code World Model, a small, specialized language model that is trained to understand how Python code executes.
Currently, AI engineers and entrepreneurs are trying to build practical AI agents. For example, ChemCrow is an open source AI agent that integrates 18 chemistry tools with an LLM. The tools include LitSearch, which extracts chemical research findings from academic papers, Name2SMILES, which converts molecule names to SMILE format, and Reaction Safety Checker, which checks for potential hazards.
As I covered in another post, the AI community is actively working on frameworks that would allow specialized AIs that are not endowed with AGI to work together to solve practical problems. While a small number of proprietary AI labs are trying to develop more general and powerful LLMs, the open source and business world is focusing on coordinating specialized models. To unlock productivity, AGI may turn out to be unnecessary, obviating the fear of a disaster scenario.
Bottom Line
The fundamental problem with the book is that it derives alleged risks from an imaginary AI technology and then extrapolates and exaggerates them into a science fiction movie extinction scenario. To summarize:
The book’s extinction scenario is a common action movie plot.
The argument that AIs are dangerous because we can’t know their motives and plans contradicts the assertion that we also understand their motives for killing us
The book warns about the risks of a fictitious AI that aren’t present in current AI technology
The claim that AI can become super-humanly intelligent by bootstrapping itself has not been shown to be possible and has very serious difficulties, such as maintenance of a persistent identity and a consistent plan over time
The book assumes that AI can generalize, but that’s still an important, unsolved problem in the field
Much of the AI community is pursuing specialized non-AGI models that would be endowed with tools and coordinated. If there is a safety risk with AGI, it may not matter since the technology may not be headed for AGI anyway.
In short, the argument for the existential dangers of AI depends on preposterous science fiction scenarios that have nothing to do with how the current technology works. Thus, my P(doom) is still zero.
What does a P(doom) of zero really mean? We can’t of course truly estimate probabilities of events that have never happened. P(doom) = 0 just means that we have no basis whatsoever for thinking there is any kind of existential risk brewing from current AI research efforts and thus we shouldn’t worry about it. But a P(doom) of zero doesn’t mean there are no risks. There are obviously many new risks that AI technology creates that must be managed.
If there are any systemic AI risks, the risk that I’d be most worried about is the capacity for AI models to assist with mass surveillance. AI models could also be used for mass propaganda. Obsession with a silly risk distracts from thinking about how to handle the genuine risks.